├── .gitignore
├── EuroSys2020
    ├── AlloX.md
    ├── Mem-disaggragate.md
    ├── delegation-sketch.md
    └── figs
    │   └── bipartite-graph.png
├── IdeaPhilosophy.md
├── LICENSE
├── NSDI2017
    ├── 2layer.png
    ├── APUNet.md
    ├── ExCamera.md
    ├── Gaia.md
    ├── Infiniswap.md
    ├── LetItFlow.md
    ├── Pytheas.md
    ├── Tux.md
    └── vCorfu.md
├── OSDI2014
    ├── Arrakis.md
    ├── Barrelfish.md
    ├── IX.md
    ├── Jitk.md
    ├── willow.md
    └── xxx.png
├── OSDI2016
    ├── CloudSys1
    │   ├── Firmament(quincy).md
    │   ├── Morpheus.md
    │   ├── altruistic.md
    │   ├── graphene.md
    │   ├── graphene_eg.png
    │   └── note.md
    ├── FaultTolerance Consensus
    │   ├── 1rtt.png
    │   ├── 2rtt.png
    │   ├── Janus.md
    │   ├── NOPaxos.md
    │   ├── Olive.md
    │   ├── XFT.md
    │   ├── multicast.png
    │   └── paxos.png
    ├── Networking
    │   ├── ERA.md
    │   ├── NetBricks.md
    │   ├── PathDump.md
    │   └── disaggregateDatacenter.md
    ├── OS1
    │   ├── LWC.md
    │   ├── Ratchet.md
    │   ├── Smelt.md
    │   ├── Yggdrasil.md
    │   ├── latency-matrix.png
    │   ├── multicore-message.png
    │   └── tree-gen.png
    ├── OS2
    │   ├── CertiKOS.md
    │   ├── EbbRT.md
    │   ├── Ingens.md
    │   ├── SCONE.md
    │   └── ingen.png
    ├── Potpourri
    │   ├── Clarinet.md
    │   ├── EC-Cache.md
    │   ├── JetStream.md
    │   ├── To Waffinity and Beyond.md
    │   ├── local.png
    │   └── local2.png
    ├── Transaction_Storage
    │   ├── Fasst.md
    │   ├── ICG.md
    │   ├── Pace.md
    │   ├── latency.png
    │   └── snow.md
    └── cloud system2
    │   ├── DQBarge.md
    │   ├── diamond.md
    │   ├── diamond.png
    │   ├── history.md
    │   ├── history.png
    │   └── slicer.md
├── README.md
├── SOSP2005
    └── speculator.md
└── gpu.md


/.gitignore:
--------------------------------------------------------------------------------
  1 | ## Core latex/pdflatex auxiliary files:
  2 | *.aux
  3 | *.lof
  4 | *.log
  5 | *.lot
  6 | *.fls
  7 | *.out
  8 | *.toc
  9 | *.fmt
 10 | *.fot
 11 | *.cb
 12 | *.cb2
 13 | 
 14 | ## Intermediate documents:
 15 | *.dvi
 16 | *-converted-to.*
 17 | # these rules might exclude image files for figures etc.
 18 | # *.ps
 19 | # *.eps
 20 | # *.pdf
 21 | 
 22 | ## Bibliography auxiliary files (bibtex/biblatex/biber):
 23 | *.bbl
 24 | *.bcf
 25 | *.blg
 26 | *-blx.aux
 27 | *-blx.bib
 28 | *.brf
 29 | *.run.xml
 30 | 
 31 | ## Build tool auxiliary files:
 32 | *.fdb_latexmk
 33 | *.synctex
 34 | *.synctex.gz
 35 | *.synctex.gz(busy)
 36 | *.pdfsync
 37 | 
 38 | ## Auxiliary and intermediate files from other packages:
 39 | # algorithms
 40 | *.alg
 41 | *.loa
 42 | 
 43 | # achemso
 44 | acs-*.bib
 45 | 
 46 | # amsthm
 47 | *.thm
 48 | 
 49 | # beamer
 50 | *.nav
 51 | *.snm
 52 | *.vrb
 53 | 
 54 | # cprotect
 55 | *.cpt
 56 | 
 57 | # fixme
 58 | *.lox
 59 | 
 60 | #(r)(e)ledmac/(r)(e)ledpar
 61 | *.end
 62 | *.?end
 63 | *.[1-9]
 64 | *.[1-9][0-9]
 65 | *.[1-9][0-9][0-9]
 66 | *.[1-9]R
 67 | *.[1-9][0-9]R
 68 | *.[1-9][0-9][0-9]R
 69 | *.eledsec[1-9]
 70 | *.eledsec[1-9]R
 71 | *.eledsec[1-9][0-9]
 72 | *.eledsec[1-9][0-9]R
 73 | *.eledsec[1-9][0-9][0-9]
 74 | *.eledsec[1-9][0-9][0-9]R
 75 | 
 76 | # glossaries
 77 | *.acn
 78 | *.acr
 79 | *.glg
 80 | *.glo
 81 | *.gls
 82 | *.glsdefs
 83 | 
 84 | # gnuplottex
 85 | *-gnuplottex-*
 86 | 
 87 | # hyperref
 88 | *.brf
 89 | 
 90 | # knitr
 91 | *-concordance.tex
 92 | # TODO Comment the next line if you want to keep your tikz graphics files
 93 | *.tikz
 94 | *-tikzDictionary
 95 | 
 96 | # listings
 97 | *.lol
 98 | 
 99 | # makeidx
100 | *.idx
101 | *.ilg
102 | *.ind
103 | *.ist
104 | 
105 | # minitoc
106 | *.maf
107 | *.mlf
108 | *.mlt
109 | *.mtc
110 | *.mtc[0-9]
111 | *.mtc[1-9][0-9]
112 | 
113 | # minted
114 | _minted*
115 | *.pyg
116 | 
117 | # morewrites
118 | *.mw
119 | 
120 | # mylatexformat
121 | *.fmt
122 | 
123 | # nomencl
124 | *.nlo
125 | 
126 | # sagetex
127 | *.sagetex.sage
128 | *.sagetex.py
129 | *.sagetex.scmd
130 | 
131 | # sympy
132 | *.sout
133 | *.sympy
134 | sympy-plots-for-*.tex/
135 | 
136 | # pdfcomment
137 | *.upa
138 | *.upb
139 | 
140 | # pythontex
141 | *.pytxcode
142 | pythontex-files-*/
143 | 
144 | # thmtools
145 | *.loe
146 | 
147 | # TikZ & PGF
148 | *.dpth
149 | *.md5
150 | *.auxlock
151 | 
152 | # todonotes
153 | *.tdo
154 | 
155 | # xindy
156 | *.xdy
157 | 
158 | # xypic precompiled matrices
159 | *.xyc
160 | 
161 | # endfloat
162 | *.ttt
163 | *.fff
164 | 
165 | # Latexian
166 | TSWLatexianTemp*
167 | 
168 | ## Editors:
169 | # WinEdt
170 | *.bak
171 | *.sav
172 | 
173 | # Texpad
174 | .texpadtmp
175 | 
176 | # Kile
177 | *.backup
178 | 
179 | # KBibTeX
180 | *~[0-9]*
181 | 


--------------------------------------------------------------------------------
/EuroSys2020/AlloX.md:
--------------------------------------------------------------------------------
1 | # AlloX: Compute Allocation in Hybrid Clusters
2 | 
3 | Optimization: Min-cost bipartite matching
4 | 
5 | ![](./figs/bipartite-graph.png)
6 | 
7 | Left: Jobs, Right: Compute resources (GPU, CPU, etc)
8 | 


--------------------------------------------------------------------------------
/EuroSys2020/Mem-disaggragate.md:
--------------------------------------------------------------------------------
 1 | # Can Far Memory Improve Job Throughput?
 2 | 
 3 | Intersting paper, try to use RDMA achieve fast and usable memory disaggratation in simulations
 4 | 
 5 | Highlights:
 6 | 
 7 | * build two RDMA fetch queue: 1 for critical path fetch, 1 for prefetch.
 8 | 
 9 | * Offload memory reclaims to dedicated CPU and make it off the creitical path, [alpha] buffer for warm-up memory reclaims
10 | 


--------------------------------------------------------------------------------
/EuroSys2020/delegation-sketch.md:
--------------------------------------------------------------------------------
1 | # Delegation Sketch: a Parallel Design with Support for Fast and Accurate Concurrent Operations
2 | 
3 | Count-min sketch in Sec 2.1 seems really useful. 
4 | 
5 | Did not go through the whole paper. The problem to solve is to achieve efficient scaling when both concurrent queries and insert occur.
6 | 


--------------------------------------------------------------------------------
/EuroSys2020/figs/bipartite-graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/EuroSys2020/figs/bipartite-graph.png


--------------------------------------------------------------------------------
/IdeaPhilosophy.md:
--------------------------------------------------------------------------------
 1 | # Philosophy
 2 | 
 3 | When one design a model or a system, it contains several assumption or designs that is not good enough. 
 4 | We can step backword and set them with more freedom, less restrictions.
 5 | 
 6 | Design system without specific model.
 7 | 
 8 | Give design more dimention of freedom.
 9 | 
10 | We can leverage more attributes or characteristics of user's applications
11 | 
12 | machine learning / deep learning only can do/simulate what people can do.
13 | 
14 | It is questionable that machine learning could predict something that people cannot do, like forcast the stock market.
15 | 
16 | ### on heap v.s. off the heap in JVM
17 | https://dzone.com/articles/heap-vs-heap-memory-usage
18 | 
19 | Detach control plan from data plan (e.g. Dizzle by Shivaram)
20 | 
21 | 
22 | 
23 | ## Confidence is the key to success.
24 | Everyone works differently. Figure out how you like to work and what makes you most productive. 
25 | 
26 | (Do you work best in the early mornings or late evening? Do you like working with others or do you prefer to work by yourself? Do you work best when you have multiple projects or just one?) 
27 | 
28 | It’s a good idea to discuss these preferences with your advisor so they understand you better and can work with you as effectively as possible
29 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "{}"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright {yyyy} {name of copyright owner}
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/NSDI2017/2layer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/NSDI2017/2layer.png


--------------------------------------------------------------------------------
/NSDI2017/APUNet.md:
--------------------------------------------------------------------------------
 1 | # APUNet: Revitalizing GPU as Packet Processing Accelerator
 2 | 
 3 | APU reduce the memory copu between GPU and CPU. (Cuz GPU is embeded in CPU)
 4 | 
 5 | 1. Reduce reduendent GPU launch and tear-down overhead => GPU persistently run threads.   Data synchronization => GPU thread group, thread group 1 process, put data to L2 cache. Thread Group using Dummy Mem I/O to enable GPU L2 Cache to push data to Main memory.
 6 | 
 7 | 2. Zero-copy .... Traditional: (NIC,CPU using mmap) copy => GPU memory. Now mmap => Shared memory, but still copy.. They direct receive data and keep it in shared memory.
 8 | 
 9 | BTW, it is APU that handle the memory copy overhead
10 | 


--------------------------------------------------------------------------------
/NSDI2017/ExCamera.md:
--------------------------------------------------------------------------------
 1 | # Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads
 2 | 
 3 | ## AWS shortcoming:
 4 | 
 5 | Start nodes, and costy if your run 4000 thread with only 1 second (You need to pay for 4000 thread (400node) for an hour fee)
 6 | 
 7 | Slow start time
 8 | 
 9 | ## AWS Lamda
10 | 
11 | AWS Lamda can achieve open 4000 thread within second, and use less money.
12 | 


--------------------------------------------------------------------------------
/NSDI2017/Gaia.md:
--------------------------------------------------------------------------------
1 | # Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds
2 | 
3 | 
4 | ## Problem
5 | 


--------------------------------------------------------------------------------
/NSDI2017/Infiniswap.md:
--------------------------------------------------------------------------------
 1 | # Efficient Memory Disaggregation with Infiniswap
 2 | 
 3 | ## how to reduce remote memory allocation metadata 
 4 | 
 5 | Using slab not paging, which is coarse granularity.
 6 | 
 7 | 
 8 | ## How to allocate remote memory distributedly
 9 | 
10 | power of two choices
11 | 
12 | ## how to do memory eviction
13 | 
14 | select x out of x+r slab
15 | 


--------------------------------------------------------------------------------
/NSDI2017/LetItFlow.md:
--------------------------------------------------------------------------------
 1 | # Let It Flow: Resilient Asymmetric Load Balancing with Flowlet Switching
 2 | 
 3 | ## problem
 4 | 
 5 | asymmetric network condition, A B channel have dynamic capacity. 
 6 | 
 7 | How to adjust transmission rate on each channel dynamically?
 8 | 
 9 | ## Advantage
10 | 
11 | No need feedback information. Simple
12 | 
13 | Only based on estimated RTT, to transmit flowlet with different time interval gap. The longer RTT, longer interval gap => less pkt transmission.
14 | 


--------------------------------------------------------------------------------
/NSDI2017/Pytheas.md:
--------------------------------------------------------------------------------
1 | # Pytheas: Enabling Data-Driven Quality of Experience Optimization Using Group-Based Exploration-Exploitation
2 | 
3 | ![](2layer.png)
4 | 
5 | Using frontend Server to make decision locally (fast). and using backend server to make decision globally (have latency)
6 | 


--------------------------------------------------------------------------------
/NSDI2017/Tux.md:
--------------------------------------------------------------------------------
 1 | # Tux: Distributed Graph Computation for Machine Learning
 2 | 
 3 | 1. SSP (stale synchronous processing)
 4 | 
 5 | 2. Hetergeneous object (i.e. vertex) for data compression & less network traffic...
 6 | 
 7 | It is a good insight. For customer recommendation, items are high degree vertex, whereas users are low degree vertex. Therefore, scan items will reduce less random memory access.
 8 | 
 9 | 3. mini batch in Graph modeling
10 | 


--------------------------------------------------------------------------------
/NSDI2017/vCorfu.md:
--------------------------------------------------------------------------------
 1 | # vCorfu: A Cloud-Scale Object Store on a Shared Log
 2 | 
 3 | ## Shared log shortcoming:
 4 | 
 5 | 1. clients can't read latest data. (cuz log only stores incremental updates) => playback is the bottleneck
 6 | 
 7 | 2. 
 8 | 
 9 | 
10 | ## vCorfu
11 | 
12 | build a two layer architecture of streams under global log.
13 | 
14 | vCorfu sequencer not only track the tail of the global log, but also the tail of each stream.
15 | 
16 | client get two address 1) global address 2) stream address.
17 | 
18 | Client write to log replica using global address, and write to stream replica using stream ID and stream address.
19 | 
20 | ### overhead
21 | 
22 | more commit log (since we need to not only write to global log, but also stream log), but it pay back.
23 | 
24 | ### benefit
25 | 
26 | now we can access log on only one replica (stream replica), instead of multiple replica belong to global log servers.
27 | 


--------------------------------------------------------------------------------
/OSDI2014/Arrakis.md:
--------------------------------------------------------------------------------
 1 | Arrakis: The Operating System is the Control Plane
 2 | 
 3 | I/O hardware is fast enough.
 4 | 
 5 | The reason why I/O is slow because of OS overhead, e.g. multiplex/de-multiplexing. isolation, I/O scheduling.
 6 | 
 7 | Arrakis skip kernel and direct let application to get data from NIC
 8 | 
 9 | ![](xxx.png)
10 | 
11 | notes: 
12 | 
13 | Storage plane: persistent data structure , operations immediately persistent on disk
14 | 
15 | => no need for serialization, 
16 | 


--------------------------------------------------------------------------------
/OSDI2014/Barrelfish.md:
--------------------------------------------------------------------------------
1 | # Decoupling Cores, Kernels, and Operating Systems
2 | 
3 | Traditional a lot of cores share ONE kernel. One kernel control and schedule APP with CPUs.
4 | 
5 | Now, decouple OS/Kernel to achieve one CPU one OS/Kernel. It allows the system to plugin new CPU or delete CPU without interfere of other CPUs
6 | 


--------------------------------------------------------------------------------
/OSDI2014/IX.md:
--------------------------------------------------------------------------------
 1 | # IX: A Protected Dataplane Operating System for High Throughput and Low Latency
 2 | 
 3 | Contribution 1:
 4 | 
 5 | Seperation of data plane and control plane.
 6 | 
 7 | 
 8 | Contribution 2:
 9 | 
10 | Execution pipeline between tx and rx.
11 | 
12 | Adaptive batching:
13 | 
14 | 1. single pkt=> no batching, just process the pkt.
15 | 
16 | 2. pkt queue build-up => Max batching, reduce system overhead.
17 | 


--------------------------------------------------------------------------------
/OSDI2014/Jitk.md:
--------------------------------------------------------------------------------
1 | # Jitk: A Trustworthy In-Kernel Interpreter Infrastructure
2 | run untrusted user code in kernel with theorem proving
3 | 
4 | Using BPF (Berkeley Packet Filter) to implement interpreter on x86.
5 | 
6 | Jit : translate BPF to x86 for in-kernel execution
7 | 


--------------------------------------------------------------------------------
/OSDI2014/willow.md:
--------------------------------------------------------------------------------
1 | # Willow: A User-Programmable SSD
2 | 
3 | People specialize SSD for different usage.
4 | 
5 | The author want to provide a programmable interface between users and SSD.
6 | 


--------------------------------------------------------------------------------
/OSDI2014/xxx.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2014/xxx.png


--------------------------------------------------------------------------------
/OSDI2016/CloudSys1/Firmament(quincy).md:
--------------------------------------------------------------------------------
 1 | # Firmament: fast, centralized cluster scheduling at scale
 2 | 
 3 | ## problem
 4 | Centralized scheduler: high quality task placement, latency of seconds or minutes.
 5 | 
 6 | Distributed scheduler: high throuput, low lantency, but poor placements.
 7 | 
 8 | hybrid scheduler: centralized placement on long-running tasks, distributed placement on short tasks.
 9 | 
10 | ## what Firmament can do
11 | * centralized scheduler
12 | * high placement quality (like Quincy)
13 | * sub-second task placement latnecy
14 | * cope with demanding situation (e.g.oversubscription)
15 | ## How they did
16 | * relaxation [Relaxation Methods for Minimum Cost Ordinary and Generalized Network Flow Problem] of quincy
17 | 
18 | 1. Terminate Min-cost Max-flow (MCMF) early to find approx. solutions => works bad
19 | 2. incremental optimization => acceptable
20 | 3.  problem specific heuristics aid
21 | 
22 | ## background
23 | * task by task placement
24 | drawbacks: 1. commits one placement early restricts its choices for further waiting tasks 2. limited opportunity to amortize work
25 | 
26 | * batrching placement (quincy)
27 | advantage: jointly consider task placement (optimal)
28 | 
29 | ## key finding
30 | Even relaxation has the highest computational complexty, it acturally performance best with least latency.
31 | 


--------------------------------------------------------------------------------
/OSDI2016/CloudSys1/Morpheus.md:
--------------------------------------------------------------------------------
 1 | # Morpheus: Towards Automated SLOs for Enterprise Clusters
 2 | 
 3 | ##problem
 4 | try to make the job performance more predictable,especially for the periods tasks. (while achieve high cluster utilization efficiency, I don't buy it)
 5 | ## handle two problem of unpredictable performance
 6 | 1 `sharing-induced` performance variablity caused by inconsistent allocations of resources across job runs
 7 | 
 8 | 2 `Inherent` perforamnce variablity caused by source code, data size different,etc.
 9 | 
10 | 
11 | ## How they did it
12 | 1. Automatic inference: Inferring SLOs and modeling job resources demands. How:  derive target SLO by analysis historical execution trace.
13 | 
14 | 2. Recurring Reservation: reserve resources based on the derived SLO (handle `sharing-induced` bias/varience)
15 | 
16 | 3. Dynamic REprovisioning: monitor resources allocation. If progress slower/faster than expected, sccheduler can automatically adjust reservcation (handle `Inherent` variance)
17 | 


--------------------------------------------------------------------------------
/OSDI2016/CloudSys1/altruistic.md:
--------------------------------------------------------------------------------
 1 | #Altruistic Scheduling in Multi-Resource Clusters
 2 | ##problem
 3 | scheduler always prefer instaneous faireness. However, instaneous fairness not result in noticible long term benefits.
 4 | ## What they want
 5 | ensure performance isolation, while has good JCT and cluster utilization
 6 | 
 7 | *Altruistic*
 8 | `leftover`: amount of resources that job cannot fully utilized. The job will offer it out.
 9 | 
10 | Intra-job: since they have DAG for each job, compute how much leftover they can provide.
11 | Inter-job: using leftover, 1 schedule tasks closet to completion, 2 packing remaining for cluster efficiency.
12 | 
13 | 
14 | ##they need prove
15 | Altruism will not inflate the JCT
16 | 
17 | 


--------------------------------------------------------------------------------
/OSDI2016/CloudSys1/graphene.md:
--------------------------------------------------------------------------------
 1 | #Graphene: Packing and Dependency-aware Scheduling for Data-Parallel Clusters
 2 | 
 3 | ##What Graphene do
 4 | 1. after generation of job DAG, it first group tasks into groups of tasks that are not troublesome. and identifies troublesome tasks. 
 5 | 2. The scheduler places troublesome tasks first.
 6 | 
 7 | # Different from existing schedulers:
 8 | Existing: A task is scheduled after all its parents finished
 9 | 
10 | Graphene: identify troublesome tasks first and place them, then place other tasks around the troupble 
11 | 
12 | # Example of different scheduling
13 | Here are good examples to show that Graphene can performs better than existing scheduling schemes in both online and offline scenarios.
14 | 
15 | ![](graphene_eg.png)
16 | 
17 | ## How to use Graphene in Multiple DAGs
18 | 1 convert offline schedule to priority order on tasks
19 | 
20 | 2. For online scheduling, enforce schedule priority along with heuristics(e.g. fairness, JCT, packing efficiency)
21 | 
22 | #Nugget
23 | 1 Offline: troublesome tasks first (For each DAG)
24 | 
25 | 2 Online: enforce priority over tasks along with other heuristics.
26 | 


--------------------------------------------------------------------------------
/OSDI2016/CloudSys1/graphene_eg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/CloudSys1/graphene_eg.png


--------------------------------------------------------------------------------
/OSDI2016/CloudSys1/note.md:
--------------------------------------------------------------------------------
1 | When one design a model or a system, it contains several assumption or designs that is not good enough.
2 | 


--------------------------------------------------------------------------------
/OSDI2016/FaultTolerance Consensus/1rtt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/FaultTolerance Consensus/1rtt.png


--------------------------------------------------------------------------------
/OSDI2016/FaultTolerance Consensus/2rtt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/FaultTolerance Consensus/2rtt.png


--------------------------------------------------------------------------------
/OSDI2016/FaultTolerance Consensus/Janus.md:
--------------------------------------------------------------------------------
 1 | #Janus: Consolidating Concurrency Control and Consensus for Commits under Conflicts
 2 | 
 3 | ##problem
 4 | Current conventioanl fault-tolerant distributed transactions has a *concurrency control* layer on top of Paxos *consensus protocol*.
 5 | 
 6 | Therefore, we need 2 coordiantion in cross-datacenter. one for concurrency control, one for consensus.Which means >= 2 RTT for coordination.
 7 | 
 8 | ## Janus
 9 | 
10 | one RTT coordination for both concurrency control & consensus. In addition, for contention of differnt transactions, Janus can handle it well via ensuring determinsitci execution re-ordering.
11 | 
12 | ![](1rtt.png)
13 | 
14 | When contention of same transaction happens, Janus ensure no abortion of the commits while may need 2 RTT for replication transactions. It is shown in the following pic.
15 | ![](2rtt.png)
16 | 
17 | 
18 | ## Key takeaway
19 | We need to consider separation of one unique move. Like TCP using SlowStart and AMID for both congestion control and fairness. So dina katabi seperate congestion control and fairness control to ensure better system performance. 
20 | 
21 | 


--------------------------------------------------------------------------------
/OSDI2016/FaultTolerance Consensus/NOPaxos.md:
--------------------------------------------------------------------------------
 1 | #Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering
 2 | 
 3 | Paxos overhead is sequencial copy data (e.g. first leader replica, then second replica, then third). In a asychronous network, this sequencial copy really introduce long delay. As shown in the figure below.
 4 | 
 5 | ![](paxos.png)
 6 | 
 7 | THe delay is illstrated in the figure.
 8 | 
 9 | Therefore, to mitigate this copy data to replica delay, the authors propose to use *multicast* for fast `inorder` data transmission, and also let host to ensure the data *reliability*, which is shown as the picture below.
10 | 
11 | ![](multicast.png)
12 | 


--------------------------------------------------------------------------------
/OSDI2016/FaultTolerance Consensus/Olive.md:
--------------------------------------------------------------------------------
 1 | #Realizing the fault-tolerance promise of cloud storage using locks with intent
 2 | 
 3 | #Problem
 4 | In cloud compuation over distributed file system
 5 | 
 6 | Failures: 
 7 | * Application can fail => networking reorder/drop messages in underline storage. 
 8 | 
 9 | * Lower-level API => Current storage operation can fail.
10 | 
11 | We need data consistency over concurrent operations on cloud storage state + failure of VM running applications. The basic idea is a do replication with paxos, in order to achieve data consistency over VMs.
12 | 
13 | However, it is a waste of storage. They try to use lock on storage layer to ensure fault-tolerance.
14 | 
15 | #Solution
16 | `lock with intent` combine computation with storage operation, using lock
17 | 
18 | `Intent` code contain cloud storage operation and local computation. It ensure that when a intent complete, each step in the intent execute *exactly once*.
19 | 
20 | `Lock` provide an `intent` exclusive access to the object. And unlock when the client holding the lock crashes.
21 | 
22 | *exactly once*: store intent with ID. Introducing DAAL (distributed atomic affinity logging), colocates log entry corresponding to executing an intent step with object changed by that step.
23 | 
24 | *Concurrent execute same intent* => `Intent Collector` periodically completes unfinished intents.
25 | 


--------------------------------------------------------------------------------
/OSDI2016/FaultTolerance Consensus/XFT.md:
--------------------------------------------------------------------------------
 1 | #XFT: Practical Fault Tolerance Beyond Crashes
 2 | 
 3 | ##Problem
 4 | Current fault tolerent scheme are CFT (Crash fault-tolerent). Therefore they cannot deal with non-crash fault, like malicious behavior.
 5 | 
 6 | BFT (Byzantine fault tolerance) can deal with non-crash fault. However, its overhead is too high compared with CFT.
 7 | 
 8 | ##XFT(cross fault tolerant)
 9 | 
10 | No need assumption of BFT that the adversaries are that powerful.
11 | 
12 | Reduce latency: (totoal 2t+1 replicas) Xpaxos only synchronously replicate clienets request to only t+1 replicas `active`. the other n `passive` replicas are using *lazy replication approach*
13 | 
14 | Difference from CFT or BFT: generate views using all the t+1 active replicas instead of the primal replica. Ensure data consistency.
15 | 
16 | FD scheme (fault detection like BFT): For non-crash faulty replica, we do not allow it to transfer its latest state to a correct replica in new sync group. It can eliminite the data poisoning from the non-crash faulty replica.
17 | 


--------------------------------------------------------------------------------
/OSDI2016/FaultTolerance Consensus/multicast.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/FaultTolerance Consensus/multicast.png


--------------------------------------------------------------------------------
/OSDI2016/FaultTolerance Consensus/paxos.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/FaultTolerance Consensus/paxos.png


--------------------------------------------------------------------------------
/OSDI2016/Networking/ERA.md:
--------------------------------------------------------------------------------
 1 | #Efficient Network Reachability Analysis using a Succinct Control Plane Representation
 2 | 
 3 | ## problem
 4 | 
 5 | It is not easy to do network verification (e.g. whether A can talk to B or not). 
 6 | 
 7 | Reason: new route announcements arrivial , links fails etc.
 8 | 
 9 | 
10 | 
11 | ## Previous work on reasoning control plane for network verification
12 | Bagpipe, rcc, ARC are focus on single or limited set of routing protocol features.
13 | 
14 | Batfish analysis entire control plane. But it does so by simulating the behavior of each routing protocol to compute the data plane, which is expensive.
15 | 
16 | ##ERA
17 | Efficient reachability analysis
18 | 
19 | directly reason about the network control plane (no simulation of data plane) , and can be scaled.
20 | 
21 | 1. analysis control plane by learning each router via its neighbors.
22 | 2. best routine when multiple routes to the same prefix are learned.
23 | 
24 | make trade off between expressive and abstraction level.
25 | 
26 | * a unified protocol-invariant routing abstraction.
27 | 
28 | * a compact binary decision diagram encoding the routers' control plane.
29 | 


--------------------------------------------------------------------------------
/OSDI2016/Networking/NetBricks.md:
--------------------------------------------------------------------------------
 1 | #NetBricks: Taking the V out of NFV
 2 | ## Problem 
 3 | NFV (network function virtualization) is the new trend compared with hardware-based middle box.
 4 | 
 5 | However, NFV introduce a lot system overhead for isolation(e.g. VMs, IPC), which decrease performance (throughput, latency).
 6 | 
 7 | Current solution: BESS + container, slow
 8 | 
 9 | ## solution
10 | NetBricks: single process, low cost.
11 | 
12 | Zero copy soft isolation(ZCSI): Use unique types to eliminate packet copy, while preserving packet isolation.
13 | 
14 | Max performance with single core of placing an entire NF chain
15 | 


--------------------------------------------------------------------------------
/OSDI2016/Networking/PathDump.md:
--------------------------------------------------------------------------------
 1 | #Simplifying Datacenter Network Debugging with PathDump
 2 | ## Problem
 3 | Network debugging is complecated and need to take all the components into consideration
 4 | 
 5 | ### problem with existing works
 6 | 
 7 | HSA [NSDI’12], Anteater [SIGCOMM’11] : analysis based on statics network snapshots => hard to get the consistent network states.
 8 | 
 9 | NetSight [NSDI’14]: per switch logging => high bandwidth and processing overhead.
10 | 
11 | Everflow [SIGCOMM’15], Planck [SIGCOMM’14] : Selective packet sampling and mirroring => identifying packets to sample for debugging problems is complex
12 | 
13 | Pathquery [NSDI’16]: SQL-like queries on switches => need dynamically installing switch rules.
14 | 
15 | 
16 | 
17 | ## nugget
18 | 
19 | Divide and conquer
20 | 
21 | ##How PathDump works
22 | 
23 | tracing packet trajectories : embedding switchID along the packet trajectory (using sampling to reduce the header space)
24 | 
25 | On the end host: get the switchID from the packet header, records them in a local storage. This storage is used for network debuggin query.
26 | 
27 | 


--------------------------------------------------------------------------------
/OSDI2016/Networking/disaggregateDatacenter.md:
--------------------------------------------------------------------------------
 1 | #Network Requirements for Resource Disaggregation
 2 | ## problem
 3 | 
 4 | Dats center speed increasing has slow down. Big company tries new disaggregate datacenter (DDC).
 5 | 
 6 | ##Key finding for DDC
 7 | 1. Network bandwidth in the range of 40-100 Gbps is sufficient to maintian application level performance 
 8 | 2. Network latency in the range of 3-5 us is needed to maintain application level perforamnce
 9 | 3. the bottleneck or root of bandwidth & latnecy is the applications memory bandwidth demand.
10 | 4. disaggregation at datacenter scale is feasible.
11 | 5. TPC/DCTCP fails to meet the target requirement.
12 | 


--------------------------------------------------------------------------------
/OSDI2016/OS1/LWC.md:
--------------------------------------------------------------------------------
 1 | #Light-Weight Contexts: An OS Abstraction for Safety and Performance
 2 | ##Problem
 3 | switching & communication between processes cost a lot (e.g. kernel scheduler, IPC, context-switching)
 4 | 
 5 | ##Nugget
 6 | isolation within a process
 7 | 
 8 | ##What they do
 9 | Decouple memory, privileges from thread, and reuse thread.
10 | 
11 | *light-weight context*: within a process, maintain isolation of virtual memory mapping, file descriptor bindings etc.
12 | 
13 | lwc orthogonal to threads. a tread may start lwc a, then switch to lwc b.
14 | 
15 | ##benefit
16 | lwc are not associated with threads.We can run untrusted code on it
17 | 


--------------------------------------------------------------------------------
/OSDI2016/OS1/Ratchet.md:
--------------------------------------------------------------------------------
 1 | #Intermittent Computation without Hardware Support or Programmer Intervention 
 2 | ##problem
 3 | Intermittent computation => write after read (WAR)
 4 | 
 5 | ##Solution
 6 | * Identify WAR
 7 | * checkpointing
 8 | * optimization => reduce checkpointing times, avoid WAR violation between caller and callee code.
 9 | 
10 | #strength
11 | * WAR(caused by intermittent) correction without hardware modification, or burdening programmer
12 | 
13 | #problem
14 | * How to handel checkpointing failure => dulicated checkpoiting write 
15 | 
16 | (say they have two database for writing the checkpointing results), ensure atomic write
17 | 


--------------------------------------------------------------------------------
/OSDI2016/OS1/Smelt.md:
--------------------------------------------------------------------------------
 1 | #Machine-aware Atomic Broadcast Trees for Multicores
 2 | ##background
 3 | ###`SMP`: Symmetric Multi-Processor 
 4 | 
 5 | all the cores are equal, and share all resources. 
 6 | If multiple requests from differnt cores on RAM => serialize access
 7 | 
 8 | ###`NUMA`: Non-Uniform Memory Access
 9 | 
10 | SMP => limited scalibility (CPU :arrow_up: => conflict in Access MEM :arrow_up:)
11 | 
12 | NUMA: a group of CPU has shared local MEM. All groups are connected by crossbar.
13 | 
14 | Shortcoming: latency of remote MEM access is much longer than local MEM access
15 | 
16 | ###`MPP`:Massive Parallel Processing
17 | group of SMP
18 | 
19 | Only local MEM access within a SMP, No remote MEM access among SMPs.
20 | 
21 | ##Difference in messaging among nodes and cores
22 | !](multicore-message.png)
23 | ##Problem
24 | Generating a good tree structure for multi-core broadcasting is important.
25 | 
26 | Multi-core hardware is complex => hard to generate a tree that one fits all solution
27 | 
28 | #How to generate tree
29 | * gathering messaging latency among cores
30 | ![](latency-matrix.png)
31 | 
32 | * Tree generating heuristics 
33 | ![](tree-gen.png)
34 | 


--------------------------------------------------------------------------------
/OSDI2016/OS1/Yggdrasil.md:
--------------------------------------------------------------------------------
 1 | #Push-Button Verification of File Systems via Crash Refinement
 2 | 
 3 | ##Problem
 4 | 
 5 | File system verfication is hard to prove (takes manually in years)
 6 | 
 7 | => Yggdrasil produce a counter example if there is a bug. No need proof.
 8 | 
 9 | ##How it works
10 | `Crash Refinement`: capture non-determinism bugs (e.g. reordering of writes)
11 | 
12 | `Crash Refinement` is achieved by a *satisfiablility modulo theories* (SMT) solver (Z3).
13 | 
14 | Speicificaition => SMT problem solver: 
15 | 
16 | To make SMT scale up solution => *layering*
17 | 
18 | exhausting all execution path within a layer, avoiding path explosion between layers.
19 | 
20 | ##How to use it
21 | input: `specification` , `implementation`,`consistency invariants`
22 | 
23 | Yggdrasil varify it.
24 | 
25 | If (pass): compile
26 | 
27 | else(fail): visualizer => counter example
28 | 
29 | ##Key contributions
30 | 
31 | Basic idea is to justify whether specification F_0 is equavalent to implementation F_1 (any to any mapping)
32 | 
33 | This is too strong(not account non-determnism like reorderings, crashes)
34 | 
35 | `Crash Refinement` : F_1 to F_0 (any to some mapping)
36 | ##Strength
37 | * counter-example based debugging is better than Proof
38 | * The specification are succinct, expressive,and free of implementation => reusable
39 | * Generalized to applications that use disk in other ways (e.g. copy)
40 | 


--------------------------------------------------------------------------------
/OSDI2016/OS1/latency-matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/OS1/latency-matrix.png


--------------------------------------------------------------------------------
/OSDI2016/OS1/multicore-message.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/OS1/multicore-message.png


--------------------------------------------------------------------------------
/OSDI2016/OS1/tree-gen.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/OS1/tree-gen.png


--------------------------------------------------------------------------------
/OSDI2016/OS2/CertiKOS.md:
--------------------------------------------------------------------------------
 1 | #CertiKOS: An Extensible Architecture for Building Certified Concurrent OS Kernels
 2 | 
 3 | ##problem
 4 | There is noe way of verifying the correctness for the concurrency program on OS.
 5 | 
 6 | ##difficulty on concurrent OS kernel verification
 7 | * interleaved execution are interdependent and hard to untangle.
 8 | * user I/O muticore are coherently working with each other.
 9 | * concurrent kernels not guarantee concurrent system call will return.
10 | * it should be easy to reuse.
11 | 
12 | ##CertiKOS
13 | using existing certified method, build a certified abstraction layer like L1 L2 ...etc.
14 | 
15 | To support concurrency,for each layer L, they parameterize it with an active thread A. we denote EC (L,A) as its environment context.
16 | 
17 | For each EC(L,A) they use a "e" to capture a specific instance (e.g. one cpu)
18 | 
19 | In a nutshell, they use existing certified method, and design a layering structure, for each layer it use "e" unit to decribe fine-graied environemnt context of one CPU or thread.
20 | 


--------------------------------------------------------------------------------
/OSDI2016/OS2/EbbRT.md:
--------------------------------------------------------------------------------
 1 | #EbbRT: A Framework for Building Per-Application Library Operating Systems
 2 | 
 3 | ## prblem:
 4 | We do not need the OS to provide pretection and isolation. Reason: Genearl purpose operating system has overhead for per-app performance.
 5 | 
 6 | ##EbbRT
 7 | It is a set of components, Ebbs, that developer can use and assemble together an application.
 8 | 
 9 | Pretty much like the SDN layers.
10 | 
11 | * Low-level event-driven exe environment
12 | * Heterogeneous distributed system architecture
13 | * Elastic building blocks
14 | 
15 | They show an exmaple of how EbbRt can be more efficient compared with Linux Memchached.
16 | 


--------------------------------------------------------------------------------
/OSDI2016/OS2/Ingens.md:
--------------------------------------------------------------------------------
 1 | #Coordinated and Efficient Huge Page Management with Ingens
 2 | ##Problem
 3 | 
 4 | High address translation cost
 5 | 
 6 | ##Traditional solution:
 7 | Huge pages improve TLB coverage
 8 | 
 9 | However, it has high page fault latency, memory bloating etc.
10 | 
11 | ##Ingens
12 | ![](ingen.png)
13 | 
14 | * page fault latency
15 | 
16 | Ingens use asynchronous allocation, the page fault handler only assign base page, and put huge page allocation in background.
17 | 
18 | * Memory Bloating
19 | 
20 | Ingens monitor spatial uitlization of each huge page region, and enable utilization based allocation.
21 | 
22 | ##What we can learn the design philosophy
23 | * Reduce Latency => put high latency work into background
24 | * Find more space => find assign unit, Check whether it is fully used. If not, try to fully utilize it.
25 | 


--------------------------------------------------------------------------------
/OSDI2016/OS2/SCONE.md:
--------------------------------------------------------------------------------
 1 | #SCONE: Secure Linux Container Environments with Intel SGX
 2 | 
 3 | ##Problem
 4 | both cloud provider and users are not trust each other. Previous works on cloud security are heavy and introduce high overhead.
 5 | 
 6 | ##Design
 7 | * Enhanced C library => small TCB
 8 | * Asynchronous system call & user space threading => reduce No. of enclave exits
 9 | * Network & file system shield => actively protect user data
10 | 


--------------------------------------------------------------------------------
/OSDI2016/OS2/ingen.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/OS2/ingen.png


--------------------------------------------------------------------------------
/OSDI2016/Potpourri/Clarinet.md:
--------------------------------------------------------------------------------
 1 | #Clarinet: WAN-Aware Optimization for Analytics Queries 
 2 | 
 3 | ##problem:
 4 | 
 5 | Geo-distributed Data Center are hard to do intra-DC queries among all DC of one company in the world.
 6 | 
 7 | ##solution:
 8 | 
 9 | The key latency they claim is the network latency in WAN. It leverage some concept on SDN, which is to have a logical to physical plan abstraction, then it is easy to implement different policy fo the logical layer query optimizer.
10 | 
11 | 
12 | Therefore, they did some network aware query placement and scheduling. For exmaple, they tries to first schedule queries that happens between two DC with higher network throughput. And late scheduling the queires that both party/DC share low bandwidth.
13 | 
14 | For iterative jobs, previously people use shortest job first scheudling algorithm to achieve low latency. They find that some links are under-utilized with SJF. Therefore, they try to enable more parallel data communication in the netowrk to achieve overall low lantency among all the DCs.That is to identify k shorest job instead of 1, and then schedule them in parallel.
15 | 
16 | 
17 | 


--------------------------------------------------------------------------------
/OSDI2016/Potpourri/EC-Cache.md:
--------------------------------------------------------------------------------
1 | #EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure Coding
2 | 
3 | using erasure coding, something like the spinal code in SIGCOMM paper, to keep cache `partial` data replica in memory. 
4 | 


--------------------------------------------------------------------------------
/OSDI2016/Potpourri/JetStream.md:
--------------------------------------------------------------------------------
 1 | #JetStream:	Cluster-scale	Parallelization	of Information Flow Queries
 2 | 
 3 | ##Dynamic information flow tracking (DIFT)
 4 | DIFT to track execution causality, also known as Taint Tracking.
 5 | 
 6 | ##problem
 7 | Parallelizing DIFT is hard. Because of sequential dependencies.
 8 | 
 9 | ##solution:
10 | Time slice execution into Epochs
11 | ![](local.png)
12 | 
13 | Here is a specific exmaple of how it works.
14 | 
15 | ![](local2.png)
16 | 
17 | 
18 | By split execution into epochs. We can parallel DIFT as a parallel streaming processing unit. 
19 | 


--------------------------------------------------------------------------------
/OSDI2016/Potpourri/To Waffinity and Beyond.md:
--------------------------------------------------------------------------------
 1 | #To Waffinity and Beyond: A Scalable Architecture for Incremental Parallelization of File System Code
 2 | 
 3 | `Classical waffinity`: unable to achieve incremental parallelization
 4 | 
 5 | Therefore, we have `hierarchical waffinity`, hierarchical aggregate data volume, instead of only one top aggregate.
 6 | 
 7 | However, sometimes we need to access two different file blocks. Traditional case can only ensure we access 1 block.
 8 | 
 9 | Therefore, we need `Hybrid waffinity` to combine partitioning with fine-grained locking. 1) Particular blocks are protected with locking from multiple affinities. 2) continues to allow incremental development.
10 | 


--------------------------------------------------------------------------------
/OSDI2016/Potpourri/local.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/Potpourri/local.png


--------------------------------------------------------------------------------
/OSDI2016/Potpourri/local2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/Potpourri/local2.png


--------------------------------------------------------------------------------
/OSDI2016/Transaction_Storage/Fasst.md:
--------------------------------------------------------------------------------
 1 | #FaSST: Fast, Scalable and Simple Distributed Transactions with Two-sided (RDMA) Datagram RPCs
 2 | 
 3 | ##Problem
 4 | Existing system: One sided RDMA...   FaSST: Two sided RDMA
 5 | 
 6 | For the hashing map read.
 7 | 
 8 | Two RTT for data trasmission, 1st: read (pointer). 2nd, read(value).  
 9 | 
10 | 1 RTT for data transmission
11 | 
12 | ### why One-side RDMA is bad
13 | cannot use lock-free I/O
14 | 
15 | One side RDMA need connection per pair (too many connections can be bad, NIC cache miss)
16 | 
17 | One side connection, the corrdination is slower.
18 | 
19 | ## Changes
20 | RPC over RDMA. RPC Involve remote CPU in the message processing, not exclude it
21 | 
22 | ##Trade-off
23 | FaSST is slower than 1 traditional RDMA read. but faster than 2 RDMA reads.
24 | 
25 | 


--------------------------------------------------------------------------------
/OSDI2016/Transaction_Storage/ICG.md:
--------------------------------------------------------------------------------
 1 | #Incremental Consistency Guarantees For Replicated Objects
 2 | 
 3 | ##Problem 
 4 | there is a dilemma between consistency and latency for geo-distributed data replication query. 
 5 | 
 6 | If you want strong consistency, we have high latency & low avaiablity.
 7 | 
 8 | If you want low latency, we have low consistency & high avaiability.
 9 | 
10 | However, neither is good.
11 | 
12 | hybrid introduce difficulty to developers.
13 | 
14 | ##What they do
15 | ![](latency.png)
16 | 
17 | Focus on the grey area, combining consistency models in a single operation
18 | 
19 | Developer invoke operation and get multiple incremental views on the results.
20 | 
21 | Firs view is the low latency (weak consistency) one, and then higher latency (better consistency), so on and so forth.
22 | 
23 | ## Key takeaway
24 | 
25 | ###Make discrete/binary things into continuous.
26 | 


--------------------------------------------------------------------------------
/OSDI2016/Transaction_Storage/Pace.md:
--------------------------------------------------------------------------------
 1 | #Correlated Crash Vulnerabilities
 2 | ##Problem
 3 | Reliability is important in distributed system
 4 | 
 5 | Core mechanism: Replication
 6 | 
 7 | Replication can endure single machine crash.
 8 | 
 9 | THe problem is : correlated crashes (all data replicas crash at the same time) and recover together.
10 | 
11 | ## How they solve the problem
12 | Determine cuts in the distributed execution.
13 | 
14 | Generate persistent states corresponding to those cuts.
15 | 
16 | Then make the states pruning to increase efficiency.
17 | 
18 | Recover based on cuts
19 | 
20 | 
21 | 


--------------------------------------------------------------------------------
/OSDI2016/Transaction_Storage/latency.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/Transaction_Storage/latency.png


--------------------------------------------------------------------------------
/OSDI2016/Transaction_Storage/snow.md:
--------------------------------------------------------------------------------
 1 | #The SNOW Theorem and Latency-Optimal Read-Only Transactions
 2 | ##Background
 3 | ### Why sharding?
 4 | 
 5 | 1. Sharding is horizontal partitioning of a database. (split rows)
 6 | 2. These partition share the same table structure.
 7 | 
 8 | ### Typical webpage read
 9 | loading a webpage, multiple reads over many shards are issued (100s to 1ks) in parallel.
10 | 
11 | Some reads are dependent on earlier reads (e.g. read B using the key return by earlier read A)
12 | 
13 | "One-shot" read-only transactions -- no key dependences across shards.
14 | ##Problem
15 | SNOW theorem says there is no way to keep both low latency and high consistency for read-only transaciotns.
16 | 
17 | ##SNOW theorem:
18 | It is IMPOSSIBLE for read only transaction algorithm to provide all four desirable properties:
19 | * Strict serializability
20 | * Non-blocking operations
21 | * One-response from each shard
22 | * Compatibility with conflicting Write trasctions
23 | 
24 | ## Key insight
25 | 
26 | SNOW is tight, only nay combination of 3 are possible.
27 | 
28 | To make reads as fast as possible, we need to shift as much coordiation overhead as possible to wirtes.
29 | 
30 | ## What they are doing
31 | A slightly weaker consistency `process-ordered` serializability without breaking the other three properties.
32 | 
33 | `COPS-SNOW`: add "O" to COPS (which only has "N")
34 | 
35 | `Recoco-SNOW`: add "O" to Recoco (which has "S+W")
36 | 


--------------------------------------------------------------------------------
/OSDI2016/cloud system2/DQBarge.md:
--------------------------------------------------------------------------------
 1 | #DQBarge: Improving data-quality tradeoffs in large-scale Internet services
 2 | 
 3 | ##Problem
 4 | 
 5 | * Time goal
 6 | 
 7 | Low level component unaware of response goal, and may fail.
 8 | 
 9 | We want to make data-quality tradoff in a proactive way. That is not need the developer to worry about the time boundary, we can systematic do so.
10 | 
11 | ## DQBarge
12 | 1. propagate critical information along the causal path of request processing.
13 | 
14 | 2. Offline stage, it uses the information get previous in step 1, to generate QoS, and proactively detemine which tradeoff to make.
15 | 
16 | 3. generating QoS using a small portion of data.
17 | 


--------------------------------------------------------------------------------
/OSDI2016/cloud system2/diamond.md:
--------------------------------------------------------------------------------
 1 | #Diamond: Automating Data Management and Storage for Wide-area, Reactive Applications
 2 | 
 3 | ACID+R: R is reactivity.
 4 | 
 5 | ##Diamond
 6 | Automated end-to-end data management and storage.
 7 | ![](diamond.png)
 8 | * Using multi-layer cache
 9 | * data optimistic concurrency control
10 | * Data Push notifications.
11 | 


--------------------------------------------------------------------------------
/OSDI2016/cloud system2/diamond.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/cloud system2/diamond.png


--------------------------------------------------------------------------------
/OSDI2016/cloud system2/history.md:
--------------------------------------------------------------------------------
1 | # History-Based Harvesting of Spare Cycles and Storage in Large-Scale Datacenters
2 | ![](history.png)
3 | 
4 | * Using history to find out the under-utilized part of the cluster, and use it
5 | 
6 | * Leverage data replica to achieve data locality: using two dimentional data replica matrix to calculate the best replica position.
7 | 
8 | 


--------------------------------------------------------------------------------
/OSDI2016/cloud system2/history.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuanhuaWang/SystemPaper/ddbaed59105b3be87b8bdb3087910307696288b1/OSDI2016/cloud system2/history.png


--------------------------------------------------------------------------------
/OSDI2016/cloud system2/slicer.md:
--------------------------------------------------------------------------------
 1 | #Slicer: Auto-Sharding for Datacenter Applications
 2 | ## prblem
 3 | Sharding or load balancing is important for data center applications.
 4 | 
 5 | We can use local memory for caching data. However, apps on Data center are not using cache.
 6 | 
 7 | ##slicer
 8 | Dynamically sharding workload.
 9 | 
10 | keeping states in RAM
11 | 
12 | Using small portion of memory on node for building a control plane. It has a centralized sharder on a central node, and do load balancing accordingly by transmiting messages among the slicer control plane within a data center.
13 | 
14 | ##What we learn
15 | Using small portion of memory to do something like cache important information.
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # SystemPaper
 2 | 
 3 | 
 4 | [![License](https://img.shields.io/badge/license-BSD-blue.svg)](LICENSE)
 5 | 
 6 | This is a paper review repo. of top-tier Computer System Conferences.
 7 | 
 8 | * OSDI 2016
 9 | * SOSP 2015
10 | * NSDI 2017
11 | 


--------------------------------------------------------------------------------
/SOSP2005/speculator.md:
--------------------------------------------------------------------------------
 1 | #Speculative Execution in a Distributed File System
 2 | ##Problem
 3 | each asynochonous write (e.g. client A) on distributed file system needs two steps: write and commit.
 4 | 
 5 | It means two RTT latency. Therefore, we need to reduce it.
 6 | 
 7 | If another client (client B) want to open the modified file, it need a getattr RPC (1 RTT).If the file is modified, the new client need to discards its cached data copy and reload the new modified data, which is time consuming
 8 | 
 9 | ##How to achieve speculative execution
10 | Client A asynchronously executes write and commit RPC, and make them in flight, speculating all modifications are up-to-date.
11 | 
12 | Client B asynchronously executes getattr, and file speculating its cached copy is up-to-date.
13 | 
14 | 
15 | 


--------------------------------------------------------------------------------
/gpu.md:
--------------------------------------------------------------------------------
  1 | 
  2 | ### Distributed GPU
  3 | data parallelism: SIMD BSP (e.g. Spark/Hadoop)
  4 | 
  5 | model parallelism:
  6 | 
  7 | ### Poseidon (Eric Xing)
  8 | 
  9 | https://github.com/sailing-pmls/poseidon
 10 | 
 11 | data transmission: partical data transmission (partial gradient Tx/Rx in Client-Server): One layer Tx/Rx instead of all layer data transmission.
 12 | 
 13 | ### Bosen (for Network efficiency)
 14 | 
 15 | Network bounded.
 16 | 
 17 | * hash table for partial output communication, which minimize lock contention.
 18 | 
 19 | * BSP (bulk synchronous parallel) v.s. TAP (total asynchronous parallelization): for machine learning, we use BSC (bounded staleness consistency)
 20 | 
 21 | 
 22 | ## GPU Net
 23 | a socket abstraction for GPU to directly access NIC
 24 | 
 25 | ring buffer: 1) for queue with fixed maximum size. 2) well-suited for FIFO, whereas non-circular buffer suited for LIFO.
 26 | 
 27 | ## Adam (MSR)
 28 | 
 29 | * model parallel, data parallel. 
 30 | 
 31 | * Try to partition data into l3 cache. 
 32 | 
 33 | * converlution layer: periodic checkpointing
 34 | 
 35 | * Fully connected layer: send activate & error gredient vector instead of Weight Matrix.
 36 | 
 37 | ## Explore bounded staleness data in ML
 38 | 
 39 | * reduce communication => using stale parameters and update the parameters after X round of iteration.
 40 | 
 41 | 
 42 | ## Google TPU evaluation
 43 | 
 44 | https://cloudplatform.googleblog.com/2017/04/quantifying-the-performance-of-the-TPU-our-first-machine-learning-chip.html
 45 | 
 46 | ## GeePS (eurosys16)
 47 | 
 48 | GPU/CPU data movement in the background. (staging, pipeline) partial data movement.
 49 | 
 50 | Parameters: shard of parameters maintain on local single machine, each machine also maintain a stale global parameter cache. (May use P2P communication)
 51 | 
 52 | 
 53 | ## Exploring iterative-ness
 54 | leveraging repeat pattern of shared data (i.e. parameters in ML)
 55 | 
 56 | parameter server: no state update conflicts. the weights can be updated in any order within a CLOCK.
 57 | 
 58 | ## parameter server (Mu Li)
 59 | 
 60 | *machine learning MapReduce parameter server*
 61 | 
 62 | ## STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning
 63 | 
 64 | improve convergence speed of model parallel ML at scale
 65 | 
 66 | ### control parameter staleness:
 67 | SSP (stale synchronous parallel): book-keeping on deviation between workers.
 68 | 
 69 | STRADS: control pipeline depth.
 70 | 
 71 | STRADS: model parallel instead of Data parallel.
 72 | 
 73 | Parameter server: couple data and control plan.....
 74 | Whereas STRADS decouple data and control plan. Either in Ring structure or Pipeline data processing over communication
 75 | 
 76 | 
 77 | ## Distributed GraphLab (VLDB 2012)
 78 | 
 79 | Serializability:
 80 | 
 81 | Full Consistency: update function has complete read-write access to its entire scope.
 82 | 
 83 | Edge consistency: update function has read access to adjacent vetices.
 84 | 
 85 | Vertex consistency: write access to central vertex data
 86 | 
 87 | ### color step to verify edge consistency
 88 | 
 89 | Color step (NP-hard, thus using greedy method): 
 90 | 
 91 | 1. edge consistency => Assign color to each vertex such taht no adjacent vetices share same color.
 92 | 
 93 | 2. Full consistency => no vertex share same color as any of its distance two neighbors.
 94 | 
 95 | 3. Vertex consistency => all vertics in one color.
 96 | 
 97 | After color step. we can synchronously executing all vertices with same color.
 98 | 
 99 | 
100 | 
101 | ## My thoughts
102 | 
103 | ### Data staleness v.s. networking latency
104 | Reduce to parameter server. Hash function, guarentee M concurrent data transmission between clients and parameter server. Others using their local stale parameters.
105 | 
106 | ### Using rateless code (e.g. Spinal coding) to achieve load balancing or congestion control.
107 | 
108 | ## GPU shortcoming
109 | 
110 | 1. small GPU memeory
111 | 
112 | 2. Expensive memory copy between GPU mem and CPU mem
113 | 
114 | 3. Data-parallelism has benefits, but network communications overhead quickly limited scalability.
115 | 
116 | 
117 | ## Why we need parameter server?
118 | 
119 | https://www.quora.com/What-is-the-Parameter-Server
120 | 
121 | In a nutshell, it is more likely to be a shared blackboard.
122 | 
123 | 
124 | 
125 | ## my thought 04/12/2017
126 | 
127 | SplitStream tree depth == stale data clock bound
128 | 
129 | WHY not P2P over workers: small messages transmission are evil. It is very difficult to keep data consistency.
130 | 
131 | erasure code :  (m+k) data split... pick fastest top m from it.   mitigate staleness of processing.
132 | 
133 | fine scale parallelism
134 | 
135 | 
136 | ## rhythm
137 | 
138 | The key insight is, given an incoming stream of requests, a server could delay some requests in order to align the execution of similar requests—allowing them to execute concurrently.
139 | 
140 | 
141 | # How to use parameter server
142 | http://pmls.readthedocs.io/en/latest/installation.html
143 | 
144 | 
145 | ### find more application scenarios.
146 | 
147 | 
148 | ## Hemingway
149 | 
150 | Mini-Batch (size b) SGD will introduce (root square of b) training error, compared with sequencial one-by-one data processing.
151 | 
152 | ### main idea
153 | 
154 | Window based prediction of Loss function, give 50 iteration info., predict 51 or 51-60 iteration info. When combining with Erest (by shivaram), it can map the iteration number to time.
155 | 
156 | 
157 | # TPU introduction
158 | https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
159 | 
160 | ## key concepts
161 | 1. Systolic array: hold the intermediate data into the ALU, and communicate the itermediate data among ALUs without write it back to register or memory. The data processing flow is more like Ray.
162 | 
163 | 2. TPU Low clock freqency. but highly parallel computing.
164 | 
165 | Operations per cycle
166 | 
167 | CPU	                                a few
168 | 
169 | CPU (vector extension)	            tens
170 | 
171 | GPU	                                tens of thousands
172 | 
173 | TPU	                                hundreds of thousands, up to 128K
174 | 
175 | 
176 | 
177 | ## Nvidia 
178 | 
179 | 1. GPUDirect RDMA
180 | 
181 | 2. NVlink
182 | 
183 | 3. MxNet
184 | 
185 | # Confidence is the key to success.
186 | 


--------------------------------------------------------------------------------