├── .gitignore ├── EuroSys2020 ├── AlloX.md ├── Mem-disaggragate.md ├── delegation-sketch.md └── figs │ └── bipartite-graph.png ├── IdeaPhilosophy.md ├── LICENSE ├── NSDI2017 ├── 2layer.png ├── APUNet.md ├── ExCamera.md ├── Gaia.md ├── Infiniswap.md ├── LetItFlow.md ├── Pytheas.md ├── Tux.md └── vCorfu.md ├── OSDI2014 ├── Arrakis.md ├── Barrelfish.md ├── IX.md ├── Jitk.md ├── willow.md └── xxx.png ├── OSDI2016 ├── CloudSys1 │ ├── Firmament(quincy).md │ ├── Morpheus.md │ ├── altruistic.md │ ├── graphene.md │ ├── graphene_eg.png │ └── note.md ├── FaultTolerance Consensus │ ├── 1rtt.png │ ├── 2rtt.png │ ├── Janus.md │ ├── NOPaxos.md │ ├── Olive.md │ ├── XFT.md │ ├── multicast.png │ └── paxos.png ├── Networking │ ├── ERA.md │ ├── NetBricks.md │ ├── PathDump.md │ └── disaggregateDatacenter.md ├── OS1 │ ├── LWC.md │ ├── Ratchet.md │ ├── Smelt.md │ ├── Yggdrasil.md │ ├── latency-matrix.png │ ├── multicore-message.png │ └── tree-gen.png ├── OS2 │ ├── CertiKOS.md │ ├── EbbRT.md │ ├── Ingens.md │ ├── SCONE.md │ └── ingen.png ├── Potpourri │ ├── Clarinet.md │ ├── EC-Cache.md │ ├── JetStream.md │ ├── To Waffinity and Beyond.md │ ├── local.png │ └── local2.png ├── Transaction_Storage │ ├── Fasst.md │ ├── ICG.md │ ├── Pace.md │ ├── latency.png │ └── snow.md └── cloud system2 │ ├── DQBarge.md │ ├── diamond.md │ ├── diamond.png │ ├── history.md │ ├── history.png │ └── slicer.md ├── README.md ├── SOSP2005 └── speculator.md └── gpu.md /.gitignore: -------------------------------------------------------------------------------- 1 | ## Core latex/pdflatex auxiliary files: 2 | *.aux 3 | *.lof 4 | *.log 5 | *.lot 6 | *.fls 7 | *.out 8 | *.toc 9 | *.fmt 10 | *.fot 11 | *.cb 12 | *.cb2 13 | 14 | ## Intermediate documents: 15 | *.dvi 16 | *-converted-to.* 17 | # these rules might exclude image files for figures etc. 18 | # *.ps 19 | # *.eps 20 | # *.pdf 21 | 22 | ## Bibliography auxiliary files (bibtex/biblatex/biber): 23 | *.bbl 24 | *.bcf 25 | *.blg 26 | *-blx.aux 27 | *-blx.bib 28 | *.brf 29 | *.run.xml 30 | 31 | ## Build tool auxiliary files: 32 | *.fdb_latexmk 33 | *.synctex 34 | *.synctex.gz 35 | *.synctex.gz(busy) 36 | *.pdfsync 37 | 38 | ## Auxiliary and intermediate files from other packages: 39 | # algorithms 40 | *.alg 41 | *.loa 42 | 43 | # achemso 44 | acs-*.bib 45 | 46 | # amsthm 47 | *.thm 48 | 49 | # beamer 50 | *.nav 51 | *.snm 52 | *.vrb 53 | 54 | # cprotect 55 | *.cpt 56 | 57 | # fixme 58 | *.lox 59 | 60 | #(r)(e)ledmac/(r)(e)ledpar 61 | *.end 62 | *.?end 63 | *.[1-9] 64 | *.[1-9][0-9] 65 | *.[1-9][0-9][0-9] 66 | *.[1-9]R 67 | *.[1-9][0-9]R 68 | *.[1-9][0-9][0-9]R 69 | *.eledsec[1-9] 70 | *.eledsec[1-9]R 71 | *.eledsec[1-9][0-9] 72 | *.eledsec[1-9][0-9]R 73 | *.eledsec[1-9][0-9][0-9] 74 | *.eledsec[1-9][0-9][0-9]R 75 | 76 | # glossaries 77 | *.acn 78 | *.acr 79 | *.glg 80 | *.glo 81 | *.gls 82 | *.glsdefs 83 | 84 | # gnuplottex 85 | *-gnuplottex-* 86 | 87 | # hyperref 88 | *.brf 89 | 90 | # knitr 91 | *-concordance.tex 92 | # TODO Comment the next line if you want to keep your tikz graphics files 93 | *.tikz 94 | *-tikzDictionary 95 | 96 | # listings 97 | *.lol 98 | 99 | # makeidx 100 | *.idx 101 | *.ilg 102 | *.ind 103 | *.ist 104 | 105 | # minitoc 106 | *.maf 107 | *.mlf 108 | *.mlt 109 | *.mtc 110 | *.mtc[0-9] 111 | *.mtc[1-9][0-9] 112 | 113 | # minted 114 | _minted* 115 | *.pyg 116 | 117 | # morewrites 118 | *.mw 119 | 120 | # mylatexformat 121 | *.fmt 122 | 123 | # nomencl 124 | *.nlo 125 | 126 | # sagetex 127 | *.sagetex.sage 128 | *.sagetex.py 129 | *.sagetex.scmd 130 | 131 | # sympy 132 | *.sout 133 | *.sympy 134 | sympy-plots-for-*.tex/ 135 | 136 | # pdfcomment 137 | *.upa 138 | *.upb 139 | 140 | # pythontex 141 | *.pytxcode 142 | pythontex-files-*/ 143 | 144 | # thmtools 145 | *.loe 146 | 147 | # TikZ & PGF 148 | *.dpth 149 | *.md5 150 | *.auxlock 151 | 152 | # todonotes 153 | *.tdo 154 | 155 | # xindy 156 | *.xdy 157 | 158 | # xypic precompiled matrices 159 | *.xyc 160 | 161 | # endfloat 162 | *.ttt 163 | *.fff 164 | 165 | # Latexian 166 | TSWLatexianTemp* 167 | 168 | ## Editors: 169 | # WinEdt 170 | *.bak 171 | *.sav 172 | 173 | # Texpad 174 | .texpadtmp 175 | 176 | # Kile 177 | *.backup 178 | 179 | # KBibTeX 180 | *~[0-9]* 181 | -------------------------------------------------------------------------------- /EuroSys2020/AlloX.md: -------------------------------------------------------------------------------- 1 | # AlloX: Compute Allocation in Hybrid Clusters 2 | 3 | Optimization: Min-cost bipartite matching 4 | 5 | ![](./figs/bipartite-graph.png) 6 | 7 | Left: Jobs, Right: Compute resources (GPU, CPU, etc) 8 | -------------------------------------------------------------------------------- /EuroSys2020/Mem-disaggragate.md: -------------------------------------------------------------------------------- 1 | # Can Far Memory Improve Job Throughput? 2 | 3 | Intersting paper, try to use RDMA achieve fast and usable memory disaggratation in simulations 4 | 5 | Highlights: 6 | 7 | * build two RDMA fetch queue: 1 for critical path fetch, 1 for prefetch. 8 | 9 | * Offload memory reclaims to dedicated CPU and make it off the creitical path, [alpha] buffer for warm-up memory reclaims 10 | -------------------------------------------------------------------------------- /EuroSys2020/delegation-sketch.md: -------------------------------------------------------------------------------- 1 | # Delegation Sketch: a Parallel Design with Support for Fast and Accurate Concurrent Operations 2 | 3 | Count-min sketch in Sec 2.1 seems really useful. 4 | 5 | Did not go through the whole paper. The problem to solve is to achieve efficient scaling when both concurrent queries and insert occur. 6 | -------------------------------------------------------------------------------- /EuroSys2020/figs/bipartite-graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/EuroSys2020/figs/bipartite-graph.png -------------------------------------------------------------------------------- /IdeaPhilosophy.md: -------------------------------------------------------------------------------- 1 | # Philosophy 2 | 3 | When one design a model or a system, it contains several assumption or designs that is not good enough. 4 | We can step backword and set them with more freedom, less restrictions. 5 | 6 | Design system without specific model. 7 | 8 | Give design more dimention of freedom. 9 | 10 | We can leverage more attributes or characteristics of user's applications 11 | 12 | machine learning / deep learning only can do/simulate what people can do. 13 | 14 | It is questionable that machine learning could predict something that people cannot do, like forcast the stock market. 15 | 16 | ### on heap v.s. off the heap in JVM 17 | https://dzone.com/articles/heap-vs-heap-memory-usage 18 | 19 | Detach control plan from data plan (e.g. Dizzle by Shivaram) 20 | 21 | 22 | 23 | ## Confidence is the key to success. 24 | Everyone works differently. Figure out how you like to work and what makes you most productive. 25 | 26 | (Do you work best in the early mornings or late evening? Do you like working with others or do you prefer to work by yourself? Do you work best when you have multiple projects or just one?) 27 | 28 | It’s a good idea to discuss these preferences with your advisor so they understand you better and can work with you as effectively as possible 29 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "{}" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright {yyyy} {name of copyright owner} 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /NSDI2017/2layer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/NSDI2017/2layer.png -------------------------------------------------------------------------------- /NSDI2017/APUNet.md: -------------------------------------------------------------------------------- 1 | # APUNet: Revitalizing GPU as Packet Processing Accelerator 2 | 3 | APU reduce the memory copu between GPU and CPU. (Cuz GPU is embeded in CPU) 4 | 5 | 1. Reduce reduendent GPU launch and tear-down overhead => GPU persistently run threads. Data synchronization => GPU thread group, thread group 1 process, put data to L2 cache. Thread Group using Dummy Mem I/O to enable GPU L2 Cache to push data to Main memory. 6 | 7 | 2. Zero-copy .... Traditional: (NIC,CPU using mmap) copy => GPU memory. Now mmap => Shared memory, but still copy.. They direct receive data and keep it in shared memory. 8 | 9 | BTW, it is APU that handle the memory copy overhead 10 | -------------------------------------------------------------------------------- /NSDI2017/ExCamera.md: -------------------------------------------------------------------------------- 1 | # Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads 2 | 3 | ## AWS shortcoming: 4 | 5 | Start nodes, and costy if your run 4000 thread with only 1 second (You need to pay for 4000 thread (400node) for an hour fee) 6 | 7 | Slow start time 8 | 9 | ## AWS Lamda 10 | 11 | AWS Lamda can achieve open 4000 thread within second, and use less money. 12 | -------------------------------------------------------------------------------- /NSDI2017/Gaia.md: -------------------------------------------------------------------------------- 1 | # Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds 2 | 3 | 4 | ## Problem 5 | -------------------------------------------------------------------------------- /NSDI2017/Infiniswap.md: -------------------------------------------------------------------------------- 1 | # Efficient Memory Disaggregation with Infiniswap 2 | 3 | ## how to reduce remote memory allocation metadata 4 | 5 | Using slab not paging, which is coarse granularity. 6 | 7 | 8 | ## How to allocate remote memory distributedly 9 | 10 | power of two choices 11 | 12 | ## how to do memory eviction 13 | 14 | select x out of x+r slab 15 | -------------------------------------------------------------------------------- /NSDI2017/LetItFlow.md: -------------------------------------------------------------------------------- 1 | # Let It Flow: Resilient Asymmetric Load Balancing with Flowlet Switching 2 | 3 | ## problem 4 | 5 | asymmetric network condition, A B channel have dynamic capacity. 6 | 7 | How to adjust transmission rate on each channel dynamically? 8 | 9 | ## Advantage 10 | 11 | No need feedback information. Simple 12 | 13 | Only based on estimated RTT, to transmit flowlet with different time interval gap. The longer RTT, longer interval gap => less pkt transmission. 14 | -------------------------------------------------------------------------------- /NSDI2017/Pytheas.md: -------------------------------------------------------------------------------- 1 | # Pytheas: Enabling Data-Driven Quality of Experience Optimization Using Group-Based Exploration-Exploitation 2 | 3 | ![](2layer.png) 4 | 5 | Using frontend Server to make decision locally (fast). and using backend server to make decision globally (have latency) 6 | -------------------------------------------------------------------------------- /NSDI2017/Tux.md: -------------------------------------------------------------------------------- 1 | # Tux: Distributed Graph Computation for Machine Learning 2 | 3 | 1. SSP (stale synchronous processing) 4 | 5 | 2. Hetergeneous object (i.e. vertex) for data compression & less network traffic... 6 | 7 | It is a good insight. For customer recommendation, items are high degree vertex, whereas users are low degree vertex. Therefore, scan items will reduce less random memory access. 8 | 9 | 3. mini batch in Graph modeling 10 | -------------------------------------------------------------------------------- /NSDI2017/vCorfu.md: -------------------------------------------------------------------------------- 1 | # vCorfu: A Cloud-Scale Object Store on a Shared Log 2 | 3 | ## Shared log shortcoming: 4 | 5 | 1. clients can't read latest data. (cuz log only stores incremental updates) => playback is the bottleneck 6 | 7 | 2. 8 | 9 | 10 | ## vCorfu 11 | 12 | build a two layer architecture of streams under global log. 13 | 14 | vCorfu sequencer not only track the tail of the global log, but also the tail of each stream. 15 | 16 | client get two address 1) global address 2) stream address. 17 | 18 | Client write to log replica using global address, and write to stream replica using stream ID and stream address. 19 | 20 | ### overhead 21 | 22 | more commit log (since we need to not only write to global log, but also stream log), but it pay back. 23 | 24 | ### benefit 25 | 26 | now we can access log on only one replica (stream replica), instead of multiple replica belong to global log servers. 27 | -------------------------------------------------------------------------------- /OSDI2014/Arrakis.md: -------------------------------------------------------------------------------- 1 | Arrakis: The Operating System is the Control Plane 2 | 3 | I/O hardware is fast enough. 4 | 5 | The reason why I/O is slow because of OS overhead, e.g. multiplex/de-multiplexing. isolation, I/O scheduling. 6 | 7 | Arrakis skip kernel and direct let application to get data from NIC 8 | 9 | ![](xxx.png) 10 | 11 | notes: 12 | 13 | Storage plane: persistent data structure , operations immediately persistent on disk 14 | 15 | => no need for serialization, 16 | -------------------------------------------------------------------------------- /OSDI2014/Barrelfish.md: -------------------------------------------------------------------------------- 1 | # Decoupling Cores, Kernels, and Operating Systems 2 | 3 | Traditional a lot of cores share ONE kernel. One kernel control and schedule APP with CPUs. 4 | 5 | Now, decouple OS/Kernel to achieve one CPU one OS/Kernel. It allows the system to plugin new CPU or delete CPU without interfere of other CPUs 6 | -------------------------------------------------------------------------------- /OSDI2014/IX.md: -------------------------------------------------------------------------------- 1 | # IX: A Protected Dataplane Operating System for High Throughput and Low Latency 2 | 3 | Contribution 1: 4 | 5 | Seperation of data plane and control plane. 6 | 7 | 8 | Contribution 2: 9 | 10 | Execution pipeline between tx and rx. 11 | 12 | Adaptive batching: 13 | 14 | 1. single pkt=> no batching, just process the pkt. 15 | 16 | 2. pkt queue build-up => Max batching, reduce system overhead. 17 | -------------------------------------------------------------------------------- /OSDI2014/Jitk.md: -------------------------------------------------------------------------------- 1 | # Jitk: A Trustworthy In-Kernel Interpreter Infrastructure 2 | run untrusted user code in kernel with theorem proving 3 | 4 | Using BPF (Berkeley Packet Filter) to implement interpreter on x86. 5 | 6 | Jit : translate BPF to x86 for in-kernel execution 7 | -------------------------------------------------------------------------------- /OSDI2014/willow.md: -------------------------------------------------------------------------------- 1 | # Willow: A User-Programmable SSD 2 | 3 | People specialize SSD for different usage. 4 | 5 | The author want to provide a programmable interface between users and SSD. 6 | -------------------------------------------------------------------------------- /OSDI2014/xxx.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2014/xxx.png -------------------------------------------------------------------------------- /OSDI2016/CloudSys1/Firmament(quincy).md: -------------------------------------------------------------------------------- 1 | # Firmament: fast, centralized cluster scheduling at scale 2 | 3 | ## problem 4 | Centralized scheduler: high quality task placement, latency of seconds or minutes. 5 | 6 | Distributed scheduler: high throuput, low lantency, but poor placements. 7 | 8 | hybrid scheduler: centralized placement on long-running tasks, distributed placement on short tasks. 9 | 10 | ## what Firmament can do 11 | * centralized scheduler 12 | * high placement quality (like Quincy) 13 | * sub-second task placement latnecy 14 | * cope with demanding situation (e.g.oversubscription) 15 | ## How they did 16 | * relaxation [Relaxation Methods for Minimum Cost Ordinary and Generalized Network Flow Problem] of quincy 17 | 18 | 1. Terminate Min-cost Max-flow (MCMF) early to find approx. solutions => works bad 19 | 2. incremental optimization => acceptable 20 | 3. problem specific heuristics aid 21 | 22 | ## background 23 | * task by task placement 24 | drawbacks: 1. commits one placement early restricts its choices for further waiting tasks 2. limited opportunity to amortize work 25 | 26 | * batrching placement (quincy) 27 | advantage: jointly consider task placement (optimal) 28 | 29 | ## key finding 30 | Even relaxation has the highest computational complexty, it acturally performance best with least latency. 31 | -------------------------------------------------------------------------------- /OSDI2016/CloudSys1/Morpheus.md: -------------------------------------------------------------------------------- 1 | # Morpheus: Towards Automated SLOs for Enterprise Clusters 2 | 3 | ##problem 4 | try to make the job performance more predictable,especially for the periods tasks. (while achieve high cluster utilization efficiency, I don't buy it) 5 | ## handle two problem of unpredictable performance 6 | 1 `sharing-induced` performance variablity caused by inconsistent allocations of resources across job runs 7 | 8 | 2 `Inherent` perforamnce variablity caused by source code, data size different,etc. 9 | 10 | 11 | ## How they did it 12 | 1. Automatic inference: Inferring SLOs and modeling job resources demands. How: derive target SLO by analysis historical execution trace. 13 | 14 | 2. Recurring Reservation: reserve resources based on the derived SLO (handle `sharing-induced` bias/varience) 15 | 16 | 3. Dynamic REprovisioning: monitor resources allocation. If progress slower/faster than expected, sccheduler can automatically adjust reservcation (handle `Inherent` variance) 17 | -------------------------------------------------------------------------------- /OSDI2016/CloudSys1/altruistic.md: -------------------------------------------------------------------------------- 1 | #Altruistic Scheduling in Multi-Resource Clusters 2 | ##problem 3 | scheduler always prefer instaneous faireness. However, instaneous fairness not result in noticible long term benefits. 4 | ## What they want 5 | ensure performance isolation, while has good JCT and cluster utilization 6 | 7 | *Altruistic* 8 | `leftover`: amount of resources that job cannot fully utilized. The job will offer it out. 9 | 10 | Intra-job: since they have DAG for each job, compute how much leftover they can provide. 11 | Inter-job: using leftover, 1 schedule tasks closet to completion, 2 packing remaining for cluster efficiency. 12 | 13 | 14 | ##they need prove 15 | Altruism will not inflate the JCT 16 | 17 | -------------------------------------------------------------------------------- /OSDI2016/CloudSys1/graphene.md: -------------------------------------------------------------------------------- 1 | #Graphene: Packing and Dependency-aware Scheduling for Data-Parallel Clusters 2 | 3 | ##What Graphene do 4 | 1. after generation of job DAG, it first group tasks into groups of tasks that are not troublesome. and identifies troublesome tasks. 5 | 2. The scheduler places troublesome tasks first. 6 | 7 | # Different from existing schedulers: 8 | Existing: A task is scheduled after all its parents finished 9 | 10 | Graphene: identify troublesome tasks first and place them, then place other tasks around the troupble 11 | 12 | # Example of different scheduling 13 | Here are good examples to show that Graphene can performs better than existing scheduling schemes in both online and offline scenarios. 14 | 15 | ![](graphene_eg.png) 16 | 17 | ## How to use Graphene in Multiple DAGs 18 | 1 convert offline schedule to priority order on tasks 19 | 20 | 2. For online scheduling, enforce schedule priority along with heuristics(e.g. fairness, JCT, packing efficiency) 21 | 22 | #Nugget 23 | 1 Offline: troublesome tasks first (For each DAG) 24 | 25 | 2 Online: enforce priority over tasks along with other heuristics. 26 | -------------------------------------------------------------------------------- /OSDI2016/CloudSys1/graphene_eg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/CloudSys1/graphene_eg.png -------------------------------------------------------------------------------- /OSDI2016/CloudSys1/note.md: -------------------------------------------------------------------------------- 1 | When one design a model or a system, it contains several assumption or designs that is not good enough. 2 | -------------------------------------------------------------------------------- /OSDI2016/FaultTolerance Consensus/1rtt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/FaultTolerance Consensus/1rtt.png -------------------------------------------------------------------------------- /OSDI2016/FaultTolerance Consensus/2rtt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/FaultTolerance Consensus/2rtt.png -------------------------------------------------------------------------------- /OSDI2016/FaultTolerance Consensus/Janus.md: -------------------------------------------------------------------------------- 1 | #Janus: Consolidating Concurrency Control and Consensus for Commits under Conflicts 2 | 3 | ##problem 4 | Current conventioanl fault-tolerant distributed transactions has a *concurrency control* layer on top of Paxos *consensus protocol*. 5 | 6 | Therefore, we need 2 coordiantion in cross-datacenter. one for concurrency control, one for consensus.Which means >= 2 RTT for coordination. 7 | 8 | ## Janus 9 | 10 | one RTT coordination for both concurrency control & consensus. In addition, for contention of differnt transactions, Janus can handle it well via ensuring determinsitci execution re-ordering. 11 | 12 | ![](1rtt.png) 13 | 14 | When contention of same transaction happens, Janus ensure no abortion of the commits while may need 2 RTT for replication transactions. It is shown in the following pic. 15 | ![](2rtt.png) 16 | 17 | 18 | ## Key takeaway 19 | We need to consider separation of one unique move. Like TCP using SlowStart and AMID for both congestion control and fairness. So dina katabi seperate congestion control and fairness control to ensure better system performance. 20 | 21 | -------------------------------------------------------------------------------- /OSDI2016/FaultTolerance Consensus/NOPaxos.md: -------------------------------------------------------------------------------- 1 | #Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering 2 | 3 | Paxos overhead is sequencial copy data (e.g. first leader replica, then second replica, then third). In a asychronous network, this sequencial copy really introduce long delay. As shown in the figure below. 4 | 5 | ![](paxos.png) 6 | 7 | THe delay is illstrated in the figure. 8 | 9 | Therefore, to mitigate this copy data to replica delay, the authors propose to use *multicast* for fast `inorder` data transmission, and also let host to ensure the data *reliability*, which is shown as the picture below. 10 | 11 | ![](multicast.png) 12 | -------------------------------------------------------------------------------- /OSDI2016/FaultTolerance Consensus/Olive.md: -------------------------------------------------------------------------------- 1 | #Realizing the fault-tolerance promise of cloud storage using locks with intent 2 | 3 | #Problem 4 | In cloud compuation over distributed file system 5 | 6 | Failures: 7 | * Application can fail => networking reorder/drop messages in underline storage. 8 | 9 | * Lower-level API => Current storage operation can fail. 10 | 11 | We need data consistency over concurrent operations on cloud storage state + failure of VM running applications. The basic idea is a do replication with paxos, in order to achieve data consistency over VMs. 12 | 13 | However, it is a waste of storage. They try to use lock on storage layer to ensure fault-tolerance. 14 | 15 | #Solution 16 | `lock with intent` combine computation with storage operation, using lock 17 | 18 | `Intent` code contain cloud storage operation and local computation. It ensure that when a intent complete, each step in the intent execute *exactly once*. 19 | 20 | `Lock` provide an `intent` exclusive access to the object. And unlock when the client holding the lock crashes. 21 | 22 | *exactly once*: store intent with ID. Introducing DAAL (distributed atomic affinity logging), colocates log entry corresponding to executing an intent step with object changed by that step. 23 | 24 | *Concurrent execute same intent* => `Intent Collector` periodically completes unfinished intents. 25 | -------------------------------------------------------------------------------- /OSDI2016/FaultTolerance Consensus/XFT.md: -------------------------------------------------------------------------------- 1 | #XFT: Practical Fault Tolerance Beyond Crashes 2 | 3 | ##Problem 4 | Current fault tolerent scheme are CFT (Crash fault-tolerent). Therefore they cannot deal with non-crash fault, like malicious behavior. 5 | 6 | BFT (Byzantine fault tolerance) can deal with non-crash fault. However, its overhead is too high compared with CFT. 7 | 8 | ##XFT(cross fault tolerant) 9 | 10 | No need assumption of BFT that the adversaries are that powerful. 11 | 12 | Reduce latency: (totoal 2t+1 replicas) Xpaxos only synchronously replicate clienets request to only t+1 replicas `active`. the other n `passive` replicas are using *lazy replication approach* 13 | 14 | Difference from CFT or BFT: generate views using all the t+1 active replicas instead of the primal replica. Ensure data consistency. 15 | 16 | FD scheme (fault detection like BFT): For non-crash faulty replica, we do not allow it to transfer its latest state to a correct replica in new sync group. It can eliminite the data poisoning from the non-crash faulty replica. 17 | -------------------------------------------------------------------------------- /OSDI2016/FaultTolerance Consensus/multicast.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/FaultTolerance Consensus/multicast.png -------------------------------------------------------------------------------- /OSDI2016/FaultTolerance Consensus/paxos.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/FaultTolerance Consensus/paxos.png -------------------------------------------------------------------------------- /OSDI2016/Networking/ERA.md: -------------------------------------------------------------------------------- 1 | #Efficient Network Reachability Analysis using a Succinct Control Plane Representation 2 | 3 | ## problem 4 | 5 | It is not easy to do network verification (e.g. whether A can talk to B or not). 6 | 7 | Reason: new route announcements arrivial , links fails etc. 8 | 9 | 10 | 11 | ## Previous work on reasoning control plane for network verification 12 | Bagpipe, rcc, ARC are focus on single or limited set of routing protocol features. 13 | 14 | Batfish analysis entire control plane. But it does so by simulating the behavior of each routing protocol to compute the data plane, which is expensive. 15 | 16 | ##ERA 17 | Efficient reachability analysis 18 | 19 | directly reason about the network control plane (no simulation of data plane) , and can be scaled. 20 | 21 | 1. analysis control plane by learning each router via its neighbors. 22 | 2. best routine when multiple routes to the same prefix are learned. 23 | 24 | make trade off between expressive and abstraction level. 25 | 26 | * a unified protocol-invariant routing abstraction. 27 | 28 | * a compact binary decision diagram encoding the routers' control plane. 29 | -------------------------------------------------------------------------------- /OSDI2016/Networking/NetBricks.md: -------------------------------------------------------------------------------- 1 | #NetBricks: Taking the V out of NFV 2 | ## Problem 3 | NFV (network function virtualization) is the new trend compared with hardware-based middle box. 4 | 5 | However, NFV introduce a lot system overhead for isolation(e.g. VMs, IPC), which decrease performance (throughput, latency). 6 | 7 | Current solution: BESS + container, slow 8 | 9 | ## solution 10 | NetBricks: single process, low cost. 11 | 12 | Zero copy soft isolation(ZCSI): Use unique types to eliminate packet copy, while preserving packet isolation. 13 | 14 | Max performance with single core of placing an entire NF chain 15 | -------------------------------------------------------------------------------- /OSDI2016/Networking/PathDump.md: -------------------------------------------------------------------------------- 1 | #Simplifying Datacenter Network Debugging with PathDump 2 | ## Problem 3 | Network debugging is complecated and need to take all the components into consideration 4 | 5 | ### problem with existing works 6 | 7 | HSA [NSDI’12], Anteater [SIGCOMM’11] : analysis based on statics network snapshots => hard to get the consistent network states. 8 | 9 | NetSight [NSDI’14]: per switch logging => high bandwidth and processing overhead. 10 | 11 | Everflow [SIGCOMM’15], Planck [SIGCOMM’14] : Selective packet sampling and mirroring => identifying packets to sample for debugging problems is complex 12 | 13 | Pathquery [NSDI’16]: SQL-like queries on switches => need dynamically installing switch rules. 14 | 15 | 16 | 17 | ## nugget 18 | 19 | Divide and conquer 20 | 21 | ##How PathDump works 22 | 23 | tracing packet trajectories : embedding switchID along the packet trajectory (using sampling to reduce the header space) 24 | 25 | On the end host: get the switchID from the packet header, records them in a local storage. This storage is used for network debuggin query. 26 | 27 | -------------------------------------------------------------------------------- /OSDI2016/Networking/disaggregateDatacenter.md: -------------------------------------------------------------------------------- 1 | #Network Requirements for Resource Disaggregation 2 | ## problem 3 | 4 | Dats center speed increasing has slow down. Big company tries new disaggregate datacenter (DDC). 5 | 6 | ##Key finding for DDC 7 | 1. Network bandwidth in the range of 40-100 Gbps is sufficient to maintian application level performance 8 | 2. Network latency in the range of 3-5 us is needed to maintain application level perforamnce 9 | 3. the bottleneck or root of bandwidth & latnecy is the applications memory bandwidth demand. 10 | 4. disaggregation at datacenter scale is feasible. 11 | 5. TPC/DCTCP fails to meet the target requirement. 12 | -------------------------------------------------------------------------------- /OSDI2016/OS1/LWC.md: -------------------------------------------------------------------------------- 1 | #Light-Weight Contexts: An OS Abstraction for Safety and Performance 2 | ##Problem 3 | switching & communication between processes cost a lot (e.g. kernel scheduler, IPC, context-switching) 4 | 5 | ##Nugget 6 | isolation within a process 7 | 8 | ##What they do 9 | Decouple memory, privileges from thread, and reuse thread. 10 | 11 | *light-weight context*: within a process, maintain isolation of virtual memory mapping, file descriptor bindings etc. 12 | 13 | lwc orthogonal to threads. a tread may start lwc a, then switch to lwc b. 14 | 15 | ##benefit 16 | lwc are not associated with threads.We can run untrusted code on it 17 | -------------------------------------------------------------------------------- /OSDI2016/OS1/Ratchet.md: -------------------------------------------------------------------------------- 1 | #Intermittent Computation without Hardware Support or Programmer Intervention 2 | ##problem 3 | Intermittent computation => write after read (WAR) 4 | 5 | ##Solution 6 | * Identify WAR 7 | * checkpointing 8 | * optimization => reduce checkpointing times, avoid WAR violation between caller and callee code. 9 | 10 | #strength 11 | * WAR(caused by intermittent) correction without hardware modification, or burdening programmer 12 | 13 | #problem 14 | * How to handel checkpointing failure => dulicated checkpoiting write 15 | 16 | (say they have two database for writing the checkpointing results), ensure atomic write 17 | -------------------------------------------------------------------------------- /OSDI2016/OS1/Smelt.md: -------------------------------------------------------------------------------- 1 | #Machine-aware Atomic Broadcast Trees for Multicores 2 | ##background 3 | ###`SMP`: Symmetric Multi-Processor 4 | 5 | all the cores are equal, and share all resources. 6 | If multiple requests from differnt cores on RAM => serialize access 7 | 8 | ###`NUMA`: Non-Uniform Memory Access 9 | 10 | SMP => limited scalibility (CPU :arrow_up: => conflict in Access MEM :arrow_up:) 11 | 12 | NUMA: a group of CPU has shared local MEM. All groups are connected by crossbar. 13 | 14 | Shortcoming: latency of remote MEM access is much longer than local MEM access 15 | 16 | ###`MPP`:Massive Parallel Processing 17 | group of SMP 18 | 19 | Only local MEM access within a SMP, No remote MEM access among SMPs. 20 | 21 | ##Difference in messaging among nodes and cores 22 | !](multicore-message.png) 23 | ##Problem 24 | Generating a good tree structure for multi-core broadcasting is important. 25 | 26 | Multi-core hardware is complex => hard to generate a tree that one fits all solution 27 | 28 | #How to generate tree 29 | * gathering messaging latency among cores 30 | ![](latency-matrix.png) 31 | 32 | * Tree generating heuristics 33 | ![](tree-gen.png) 34 | -------------------------------------------------------------------------------- /OSDI2016/OS1/Yggdrasil.md: -------------------------------------------------------------------------------- 1 | #Push-Button Verification of File Systems via Crash Refinement 2 | 3 | ##Problem 4 | 5 | File system verfication is hard to prove (takes manually in years) 6 | 7 | => Yggdrasil produce a counter example if there is a bug. No need proof. 8 | 9 | ##How it works 10 | `Crash Refinement`: capture non-determinism bugs (e.g. reordering of writes) 11 | 12 | `Crash Refinement` is achieved by a *satisfiablility modulo theories* (SMT) solver (Z3). 13 | 14 | Speicificaition => SMT problem solver: 15 | 16 | To make SMT scale up solution => *layering* 17 | 18 | exhausting all execution path within a layer, avoiding path explosion between layers. 19 | 20 | ##How to use it 21 | input: `specification` , `implementation`,`consistency invariants` 22 | 23 | Yggdrasil varify it. 24 | 25 | If (pass): compile 26 | 27 | else(fail): visualizer => counter example 28 | 29 | ##Key contributions 30 | 31 | Basic idea is to justify whether specification F_0 is equavalent to implementation F_1 (any to any mapping) 32 | 33 | This is too strong(not account non-determnism like reorderings, crashes) 34 | 35 | `Crash Refinement` : F_1 to F_0 (any to some mapping) 36 | ##Strength 37 | * counter-example based debugging is better than Proof 38 | * The specification are succinct, expressive,and free of implementation => reusable 39 | * Generalized to applications that use disk in other ways (e.g. copy) 40 | -------------------------------------------------------------------------------- /OSDI2016/OS1/latency-matrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/OS1/latency-matrix.png -------------------------------------------------------------------------------- /OSDI2016/OS1/multicore-message.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/OS1/multicore-message.png -------------------------------------------------------------------------------- /OSDI2016/OS1/tree-gen.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/OS1/tree-gen.png -------------------------------------------------------------------------------- /OSDI2016/OS2/CertiKOS.md: -------------------------------------------------------------------------------- 1 | #CertiKOS: An Extensible Architecture for Building Certified Concurrent OS Kernels 2 | 3 | ##problem 4 | There is noe way of verifying the correctness for the concurrency program on OS. 5 | 6 | ##difficulty on concurrent OS kernel verification 7 | * interleaved execution are interdependent and hard to untangle. 8 | * user I/O muticore are coherently working with each other. 9 | * concurrent kernels not guarantee concurrent system call will return. 10 | * it should be easy to reuse. 11 | 12 | ##CertiKOS 13 | using existing certified method, build a certified abstraction layer like L1 L2 ...etc. 14 | 15 | To support concurrency,for each layer L, they parameterize it with an active thread A. we denote EC (L,A) as its environment context. 16 | 17 | For each EC(L,A) they use a "e" to capture a specific instance (e.g. one cpu) 18 | 19 | In a nutshell, they use existing certified method, and design a layering structure, for each layer it use "e" unit to decribe fine-graied environemnt context of one CPU or thread. 20 | -------------------------------------------------------------------------------- /OSDI2016/OS2/EbbRT.md: -------------------------------------------------------------------------------- 1 | #EbbRT: A Framework for Building Per-Application Library Operating Systems 2 | 3 | ## prblem: 4 | We do not need the OS to provide pretection and isolation. Reason: Genearl purpose operating system has overhead for per-app performance. 5 | 6 | ##EbbRT 7 | It is a set of components, Ebbs, that developer can use and assemble together an application. 8 | 9 | Pretty much like the SDN layers. 10 | 11 | * Low-level event-driven exe environment 12 | * Heterogeneous distributed system architecture 13 | * Elastic building blocks 14 | 15 | They show an exmaple of how EbbRt can be more efficient compared with Linux Memchached. 16 | -------------------------------------------------------------------------------- /OSDI2016/OS2/Ingens.md: -------------------------------------------------------------------------------- 1 | #Coordinated and Efficient Huge Page Management with Ingens 2 | ##Problem 3 | 4 | High address translation cost 5 | 6 | ##Traditional solution: 7 | Huge pages improve TLB coverage 8 | 9 | However, it has high page fault latency, memory bloating etc. 10 | 11 | ##Ingens 12 | ![](ingen.png) 13 | 14 | * page fault latency 15 | 16 | Ingens use asynchronous allocation, the page fault handler only assign base page, and put huge page allocation in background. 17 | 18 | * Memory Bloating 19 | 20 | Ingens monitor spatial uitlization of each huge page region, and enable utilization based allocation. 21 | 22 | ##What we can learn the design philosophy 23 | * Reduce Latency => put high latency work into background 24 | * Find more space => find assign unit, Check whether it is fully used. If not, try to fully utilize it. 25 | -------------------------------------------------------------------------------- /OSDI2016/OS2/SCONE.md: -------------------------------------------------------------------------------- 1 | #SCONE: Secure Linux Container Environments with Intel SGX 2 | 3 | ##Problem 4 | both cloud provider and users are not trust each other. Previous works on cloud security are heavy and introduce high overhead. 5 | 6 | ##Design 7 | * Enhanced C library => small TCB 8 | * Asynchronous system call & user space threading => reduce No. of enclave exits 9 | * Network & file system shield => actively protect user data 10 | -------------------------------------------------------------------------------- /OSDI2016/OS2/ingen.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/OS2/ingen.png -------------------------------------------------------------------------------- /OSDI2016/Potpourri/Clarinet.md: -------------------------------------------------------------------------------- 1 | #Clarinet: WAN-Aware Optimization for Analytics Queries 2 | 3 | ##problem: 4 | 5 | Geo-distributed Data Center are hard to do intra-DC queries among all DC of one company in the world. 6 | 7 | ##solution: 8 | 9 | The key latency they claim is the network latency in WAN. It leverage some concept on SDN, which is to have a logical to physical plan abstraction, then it is easy to implement different policy fo the logical layer query optimizer. 10 | 11 | 12 | Therefore, they did some network aware query placement and scheduling. For exmaple, they tries to first schedule queries that happens between two DC with higher network throughput. And late scheduling the queires that both party/DC share low bandwidth. 13 | 14 | For iterative jobs, previously people use shortest job first scheudling algorithm to achieve low latency. They find that some links are under-utilized with SJF. Therefore, they try to enable more parallel data communication in the netowrk to achieve overall low lantency among all the DCs.That is to identify k shorest job instead of 1, and then schedule them in parallel. 15 | 16 | 17 | -------------------------------------------------------------------------------- /OSDI2016/Potpourri/EC-Cache.md: -------------------------------------------------------------------------------- 1 | #EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure Coding 2 | 3 | using erasure coding, something like the spinal code in SIGCOMM paper, to keep cache `partial` data replica in memory. 4 | -------------------------------------------------------------------------------- /OSDI2016/Potpourri/JetStream.md: -------------------------------------------------------------------------------- 1 | #JetStream: Cluster-scale Parallelization of Information Flow Queries 2 | 3 | ##Dynamic information flow tracking (DIFT) 4 | DIFT to track execution causality, also known as Taint Tracking. 5 | 6 | ##problem 7 | Parallelizing DIFT is hard. Because of sequential dependencies. 8 | 9 | ##solution: 10 | Time slice execution into Epochs 11 | ![](local.png) 12 | 13 | Here is a specific exmaple of how it works. 14 | 15 | ![](local2.png) 16 | 17 | 18 | By split execution into epochs. We can parallel DIFT as a parallel streaming processing unit. 19 | -------------------------------------------------------------------------------- /OSDI2016/Potpourri/To Waffinity and Beyond.md: -------------------------------------------------------------------------------- 1 | #To Waffinity and Beyond: A Scalable Architecture for Incremental Parallelization of File System Code 2 | 3 | `Classical waffinity`: unable to achieve incremental parallelization 4 | 5 | Therefore, we have `hierarchical waffinity`, hierarchical aggregate data volume, instead of only one top aggregate. 6 | 7 | However, sometimes we need to access two different file blocks. Traditional case can only ensure we access 1 block. 8 | 9 | Therefore, we need `Hybrid waffinity` to combine partitioning with fine-grained locking. 1) Particular blocks are protected with locking from multiple affinities. 2) continues to allow incremental development. 10 | -------------------------------------------------------------------------------- /OSDI2016/Potpourri/local.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/Potpourri/local.png -------------------------------------------------------------------------------- /OSDI2016/Potpourri/local2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/Potpourri/local2.png -------------------------------------------------------------------------------- /OSDI2016/Transaction_Storage/Fasst.md: -------------------------------------------------------------------------------- 1 | #FaSST: Fast, Scalable and Simple Distributed Transactions with Two-sided (RDMA) Datagram RPCs 2 | 3 | ##Problem 4 | Existing system: One sided RDMA... FaSST: Two sided RDMA 5 | 6 | For the hashing map read. 7 | 8 | Two RTT for data trasmission, 1st: read (pointer). 2nd, read(value). 9 | 10 | 1 RTT for data transmission 11 | 12 | ### why One-side RDMA is bad 13 | cannot use lock-free I/O 14 | 15 | One side RDMA need connection per pair (too many connections can be bad, NIC cache miss) 16 | 17 | One side connection, the corrdination is slower. 18 | 19 | ## Changes 20 | RPC over RDMA. RPC Involve remote CPU in the message processing, not exclude it 21 | 22 | ##Trade-off 23 | FaSST is slower than 1 traditional RDMA read. but faster than 2 RDMA reads. 24 | 25 | -------------------------------------------------------------------------------- /OSDI2016/Transaction_Storage/ICG.md: -------------------------------------------------------------------------------- 1 | #Incremental Consistency Guarantees For Replicated Objects 2 | 3 | ##Problem 4 | there is a dilemma between consistency and latency for geo-distributed data replication query. 5 | 6 | If you want strong consistency, we have high latency & low avaiablity. 7 | 8 | If you want low latency, we have low consistency & high avaiability. 9 | 10 | However, neither is good. 11 | 12 | hybrid introduce difficulty to developers. 13 | 14 | ##What they do 15 | ![](latency.png) 16 | 17 | Focus on the grey area, combining consistency models in a single operation 18 | 19 | Developer invoke operation and get multiple incremental views on the results. 20 | 21 | Firs view is the low latency (weak consistency) one, and then higher latency (better consistency), so on and so forth. 22 | 23 | ## Key takeaway 24 | 25 | ###Make discrete/binary things into continuous. 26 | -------------------------------------------------------------------------------- /OSDI2016/Transaction_Storage/Pace.md: -------------------------------------------------------------------------------- 1 | #Correlated Crash Vulnerabilities 2 | ##Problem 3 | Reliability is important in distributed system 4 | 5 | Core mechanism: Replication 6 | 7 | Replication can endure single machine crash. 8 | 9 | THe problem is : correlated crashes (all data replicas crash at the same time) and recover together. 10 | 11 | ## How they solve the problem 12 | Determine cuts in the distributed execution. 13 | 14 | Generate persistent states corresponding to those cuts. 15 | 16 | Then make the states pruning to increase efficiency. 17 | 18 | Recover based on cuts 19 | 20 | 21 | -------------------------------------------------------------------------------- /OSDI2016/Transaction_Storage/latency.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/Transaction_Storage/latency.png -------------------------------------------------------------------------------- /OSDI2016/Transaction_Storage/snow.md: -------------------------------------------------------------------------------- 1 | #The SNOW Theorem and Latency-Optimal Read-Only Transactions 2 | ##Background 3 | ### Why sharding? 4 | 5 | 1. Sharding is horizontal partitioning of a database. (split rows) 6 | 2. These partition share the same table structure. 7 | 8 | ### Typical webpage read 9 | loading a webpage, multiple reads over many shards are issued (100s to 1ks) in parallel. 10 | 11 | Some reads are dependent on earlier reads (e.g. read B using the key return by earlier read A) 12 | 13 | "One-shot" read-only transactions -- no key dependences across shards. 14 | ##Problem 15 | SNOW theorem says there is no way to keep both low latency and high consistency for read-only transaciotns. 16 | 17 | ##SNOW theorem: 18 | It is IMPOSSIBLE for read only transaction algorithm to provide all four desirable properties: 19 | * Strict serializability 20 | * Non-blocking operations 21 | * One-response from each shard 22 | * Compatibility with conflicting Write trasctions 23 | 24 | ## Key insight 25 | 26 | SNOW is tight, only nay combination of 3 are possible. 27 | 28 | To make reads as fast as possible, we need to shift as much coordiation overhead as possible to wirtes. 29 | 30 | ## What they are doing 31 | A slightly weaker consistency `process-ordered` serializability without breaking the other three properties. 32 | 33 | `COPS-SNOW`: add "O" to COPS (which only has "N") 34 | 35 | `Recoco-SNOW`: add "O" to Recoco (which has "S+W") 36 | -------------------------------------------------------------------------------- /OSDI2016/cloud system2/DQBarge.md: -------------------------------------------------------------------------------- 1 | #DQBarge: Improving data-quality tradeoffs in large-scale Internet services 2 | 3 | ##Problem 4 | 5 | * Time goal 6 | 7 | Low level component unaware of response goal, and may fail. 8 | 9 | We want to make data-quality tradoff in a proactive way. That is not need the developer to worry about the time boundary, we can systematic do so. 10 | 11 | ## DQBarge 12 | 1. propagate critical information along the causal path of request processing. 13 | 14 | 2. Offline stage, it uses the information get previous in step 1, to generate QoS, and proactively detemine which tradeoff to make. 15 | 16 | 3. generating QoS using a small portion of data. 17 | -------------------------------------------------------------------------------- /OSDI2016/cloud system2/diamond.md: -------------------------------------------------------------------------------- 1 | #Diamond: Automating Data Management and Storage for Wide-area, Reactive Applications 2 | 3 | ACID+R: R is reactivity. 4 | 5 | ##Diamond 6 | Automated end-to-end data management and storage. 7 | ![](diamond.png) 8 | * Using multi-layer cache 9 | * data optimistic concurrency control 10 | * Data Push notifications. 11 | -------------------------------------------------------------------------------- /OSDI2016/cloud system2/diamond.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/cloud system2/diamond.png -------------------------------------------------------------------------------- /OSDI2016/cloud system2/history.md: -------------------------------------------------------------------------------- 1 | # History-Based Harvesting of Spare Cycles and Storage in Large-Scale Datacenters 2 | ![](history.png) 3 | 4 | * Using history to find out the under-utilized part of the cluster, and use it 5 | 6 | * Leverage data replica to achieve data locality: using two dimentional data replica matrix to calculate the best replica position. 7 | 8 | -------------------------------------------------------------------------------- /OSDI2016/cloud system2/history.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/cloud system2/history.png -------------------------------------------------------------------------------- /OSDI2016/cloud system2/slicer.md: -------------------------------------------------------------------------------- 1 | #Slicer: Auto-Sharding for Datacenter Applications 2 | ## prblem 3 | Sharding or load balancing is important for data center applications. 4 | 5 | We can use local memory for caching data. However, apps on Data center are not using cache. 6 | 7 | ##slicer 8 | Dynamically sharding workload. 9 | 10 | keeping states in RAM 11 | 12 | Using small portion of memory on node for building a control plane. It has a centralized sharder on a central node, and do load balancing accordingly by transmiting messages among the slicer control plane within a data center. 13 | 14 | ##What we learn 15 | Using small portion of memory to do something like cache important information. 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SystemPaper 2 | 3 | 4 | [![License](https://img.shields.io/badge/license-BSD-blue.svg)](LICENSE) 5 | 6 | This is a paper review repo. of top-tier Computer System Conferences. 7 | 8 | * OSDI 2016 9 | * SOSP 2015 10 | * NSDI 2017 11 | -------------------------------------------------------------------------------- /SOSP2005/speculator.md: -------------------------------------------------------------------------------- 1 | #Speculative Execution in a Distributed File System 2 | ##Problem 3 | each asynochonous write (e.g. client A) on distributed file system needs two steps: write and commit. 4 | 5 | It means two RTT latency. Therefore, we need to reduce it. 6 | 7 | If another client (client B) want to open the modified file, it need a getattr RPC (1 RTT).If the file is modified, the new client need to discards its cached data copy and reload the new modified data, which is time consuming 8 | 9 | ##How to achieve speculative execution 10 | Client A asynchronously executes write and commit RPC, and make them in flight, speculating all modifications are up-to-date. 11 | 12 | Client B asynchronously executes getattr, and file speculating its cached copy is up-to-date. 13 | 14 | 15 | -------------------------------------------------------------------------------- /gpu.md: -------------------------------------------------------------------------------- 1 | 2 | ### Distributed GPU 3 | data parallelism: SIMD BSP (e.g. Spark/Hadoop) 4 | 5 | model parallelism: 6 | 7 | ### Poseidon (Eric Xing) 8 | 9 | https://github.com/sailing-pmls/poseidon 10 | 11 | data transmission: partical data transmission (partial gradient Tx/Rx in Client-Server): One layer Tx/Rx instead of all layer data transmission. 12 | 13 | ### Bosen (for Network efficiency) 14 | 15 | Network bounded. 16 | 17 | * hash table for partial output communication, which minimize lock contention. 18 | 19 | * BSP (bulk synchronous parallel) v.s. TAP (total asynchronous parallelization): for machine learning, we use BSC (bounded staleness consistency) 20 | 21 | 22 | ## GPU Net 23 | a socket abstraction for GPU to directly access NIC 24 | 25 | ring buffer: 1) for queue with fixed maximum size. 2) well-suited for FIFO, whereas non-circular buffer suited for LIFO. 26 | 27 | ## Adam (MSR) 28 | 29 | * model parallel, data parallel. 30 | 31 | * Try to partition data into l3 cache. 32 | 33 | * converlution layer: periodic checkpointing 34 | 35 | * Fully connected layer: send activate & error gredient vector instead of Weight Matrix. 36 | 37 | ## Explore bounded staleness data in ML 38 | 39 | * reduce communication => using stale parameters and update the parameters after X round of iteration. 40 | 41 | 42 | ## Google TPU evaluation 43 | 44 | https://cloudplatform.googleblog.com/2017/04/quantifying-the-performance-of-the-TPU-our-first-machine-learning-chip.html 45 | 46 | ## GeePS (eurosys16) 47 | 48 | GPU/CPU data movement in the background. (staging, pipeline) partial data movement. 49 | 50 | Parameters: shard of parameters maintain on local single machine, each machine also maintain a stale global parameter cache. (May use P2P communication) 51 | 52 | 53 | ## Exploring iterative-ness 54 | leveraging repeat pattern of shared data (i.e. parameters in ML) 55 | 56 | parameter server: no state update conflicts. the weights can be updated in any order within a CLOCK. 57 | 58 | ## parameter server (Mu Li) 59 | 60 | *machine learning MapReduce parameter server* 61 | 62 | ## STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning 63 | 64 | improve convergence speed of model parallel ML at scale 65 | 66 | ### control parameter staleness: 67 | SSP (stale synchronous parallel): book-keeping on deviation between workers. 68 | 69 | STRADS: control pipeline depth. 70 | 71 | STRADS: model parallel instead of Data parallel. 72 | 73 | Parameter server: couple data and control plan..... 74 | Whereas STRADS decouple data and control plan. Either in Ring structure or Pipeline data processing over communication 75 | 76 | 77 | ## Distributed GraphLab (VLDB 2012) 78 | 79 | Serializability: 80 | 81 | Full Consistency: update function has complete read-write access to its entire scope. 82 | 83 | Edge consistency: update function has read access to adjacent vetices. 84 | 85 | Vertex consistency: write access to central vertex data 86 | 87 | ### color step to verify edge consistency 88 | 89 | Color step (NP-hard, thus using greedy method): 90 | 91 | 1. edge consistency => Assign color to each vertex such taht no adjacent vetices share same color. 92 | 93 | 2. Full consistency => no vertex share same color as any of its distance two neighbors. 94 | 95 | 3. Vertex consistency => all vertics in one color. 96 | 97 | After color step. we can synchronously executing all vertices with same color. 98 | 99 | 100 | 101 | ## My thoughts 102 | 103 | ### Data staleness v.s. networking latency 104 | Reduce to parameter server. Hash function, guarentee M concurrent data transmission between clients and parameter server. Others using their local stale parameters. 105 | 106 | ### Using rateless code (e.g. Spinal coding) to achieve load balancing or congestion control. 107 | 108 | ## GPU shortcoming 109 | 110 | 1. small GPU memeory 111 | 112 | 2. Expensive memory copy between GPU mem and CPU mem 113 | 114 | 3. Data-parallelism has benefits, but network communications overhead quickly limited scalability. 115 | 116 | 117 | ## Why we need parameter server? 118 | 119 | https://www.quora.com/What-is-the-Parameter-Server 120 | 121 | In a nutshell, it is more likely to be a shared blackboard. 122 | 123 | 124 | 125 | ## my thought 04/12/2017 126 | 127 | SplitStream tree depth == stale data clock bound 128 | 129 | WHY not P2P over workers: small messages transmission are evil. It is very difficult to keep data consistency. 130 | 131 | erasure code : (m+k) data split... pick fastest top m from it. mitigate staleness of processing. 132 | 133 | fine scale parallelism 134 | 135 | 136 | ## rhythm 137 | 138 | The key insight is, given an incoming stream of requests, a server could delay some requests in order to align the execution of similar requests—allowing them to execute concurrently. 139 | 140 | 141 | # How to use parameter server 142 | http://pmls.readthedocs.io/en/latest/installation.html 143 | 144 | 145 | ### find more application scenarios. 146 | 147 | 148 | ## Hemingway 149 | 150 | Mini-Batch (size b) SGD will introduce (root square of b) training error, compared with sequencial one-by-one data processing. 151 | 152 | ### main idea 153 | 154 | Window based prediction of Loss function, give 50 iteration info., predict 51 or 51-60 iteration info. When combining with Erest (by shivaram), it can map the iteration number to time. 155 | 156 | 157 | # TPU introduction 158 | https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu 159 | 160 | ## key concepts 161 | 1. Systolic array: hold the intermediate data into the ALU, and communicate the itermediate data among ALUs without write it back to register or memory. The data processing flow is more like Ray. 162 | 163 | 2. TPU Low clock freqency. but highly parallel computing. 164 | 165 | Operations per cycle 166 | 167 | CPU a few 168 | 169 | CPU (vector extension) tens 170 | 171 | GPU tens of thousands 172 | 173 | TPU hundreds of thousands, up to 128K 174 | 175 | 176 | 177 | ## Nvidia 178 | 179 | 1. GPUDirect RDMA 180 | 181 | 2. NVlink 182 | 183 | 3. MxNet 184 | 185 | # Confidence is the key to success. 186 | --------------------------------------------------------------------------------