├── .gitignore ├── Makefile ├── README.md ├── TODO.md ├── conferences ├── README.md ├── sigmod2016.md └── tapp.md ├── css ├── default_highlight.css └── style.css ├── footer.html ├── header.html ├── html ├── adya2000generalized.html ├── agarwal1996computation.html ├── agrawal1987concurrency.html ├── akidau2013millwheel.html ├── alvaro2010boom.html ├── alvaro2011consistency.html ├── alvaro2011dedalus.html ├── alvaro2015lineage.html ├── arasu2006cql.html ├── armbrust2015spark.html ├── avnur2000eddies.html ├── bailis2013highly.html ├── bailis2014coordination.html ├── balegas2015putting.html ├── barham2003xen.html ├── baumann2015shielding.html ├── beckmann1990r.html ├── berenson1995critique.html ├── bershad1995spin.html ├── brewer2005combining.html ├── brewer2012cap.html ├── brin1998anatomy.html ├── bugnion1997disco.html ├── burns2016borg.html ├── burrows2006chubby.html ├── cafarella2008webtables.html ├── carriero1994linda.html ├── chamberlin1981history.html ├── chang2008bigtable.html ├── chen2016realtime.html ├── cheney2009provenance.html ├── clark2005live.html ├── codd1970relational.html ├── conway2012logic.html ├── conway2014edelweiss.html ├── corbett2013spanner.html ├── crooks2016tardis.html ├── dan2017using.html ├── dean2008mapreduce.html ├── decandia2007dynamo.html ├── dewitt1990gamma.html ├── dewitt1992parallel.html ├── diaconu2013hekaton.html ├── elphinstone2013l3.html ├── engler1995exokernel.html ├── ghemawat2003google.html ├── ghodsi2011dominant.html ├── gilbert2002brewer.html ├── goldman1997dataguides.html ├── golub1992microkernel.html ├── gonzalez2012powergraph.html ├── gotsman2016cause.html ├── graefe1990encapsulation.html ├── graefe1993volcano.html ├── graefe2009five.html ├── gray1976granularity.html ├── green2013datalog.html ├── halevy2016goods.html ├── harinarayan1996implementing.html ├── hellerstein2007architecture.html ├── hellerstein2010declarative.html ├── hellerstein2012madlib.html ├── herlihy1990linearizability.html ├── hindman2011mesos.html ├── holt2016disciplined.html ├── isard2007dryad.html ├── kandel2011wrangler.html ├── kapritsos2012all.html ├── klonatos2014building.html ├── kohler2000click.html ├── kohler2012declarative.html ├── kornacker2015impala.html ├── kulkarni2015twitter.html ├── kung1981optimistic.html ├── lagar2009snowflock.html ├── lampson1980experience.html ├── lauer1979duality.html ├── lehman1981efficient.html ├── lehman1999t.html ├── letia2009crdts.html ├── li2012making.html ├── li2014automating.html ├── li2014scaling.html ├── lin2016towards.html ├── liu1973scheduling.html ├── lloyd2011don.html ├── madden2002tag.html ├── maddox2016decibel.html ├── mckeen2013innovative.html ├── mckusick1984fast.html ├── miller2000schema.html ├── mohan1986transaction.html ├── moore2006inferring.html ├── murray2013naiad.html ├── o1986escrow.html ├── o1997improved.html ├── ongaro2014search.html ├── prabhakaran2005analysis.html ├── quigley2009ros.html ├── recht2011hogwild.html ├── ritchie1978unix.html ├── roy2015homeostasis.html ├── saltzer1984end.html ├── selinger1979access.html ├── shapiro2011conflict.html ├── sigelman2010dapper.html ├── stoica2001chord.html ├── stonebraker1987design.html ├── stonebraker1991postgres.html ├── stonebraker2005c.html ├── taft2014store.html ├── terry1995managing.html ├── terry2013replicated.html ├── thomson2012calvin.html ├── toshniwal2014storm.html ├── tu2013speedy.html ├── van2004chain.html ├── vavilapalli2013apache.html ├── verma2015large.html ├── vogels2009eventually.html ├── waldspurger1994lottery.html ├── welsh2001seda.html ├── wilkes1996hp.html ├── yu2008dryadlinq.html ├── zaharia2012resilient.html ├── zaharia2013discretized.html ├── zhou2010efficient.html ├── zhou2011tap.html └── zhou2012distributed.html ├── images ├── dan2017using_fork.png └── dan2017using_zigzag.png ├── index.html ├── js ├── highlight.pack.js └── mathjax_config.js └── papers ├── adya2000generalized.md ├── agarwal1996computation.md ├── agrawal1987concurrency.md ├── akidau2013millwheel.md ├── alvaro2010boom.md ├── alvaro2011consistency.md ├── alvaro2011dedalus.md ├── alvaro2015lineage.md ├── arasu2006cql.md ├── armbrust2015spark.md ├── avnur2000eddies.md ├── bailis2013highly.md ├── bailis2014coordination.md ├── balegas2015putting.md ├── barham2003xen.md ├── baumann2015shielding.md ├── beckmann1990r.md ├── berenson1995critique.md ├── bershad1995spin.md ├── brewer2005combining.md ├── brewer2012cap.md ├── brin1998anatomy.md ├── bugnion1997disco.md ├── burns2016borg.md ├── burrows2006chubby.md ├── cafarella2008webtables.md ├── carriero1994linda.md ├── chamberlin1981history.md ├── chang2008bigtable.md ├── chen2016realtime.md ├── cheney2009provenance.md ├── clark2005live.md ├── codd1970relational.md ├── conway2012logic.md ├── conway2014edelweiss.md ├── corbett2013spanner.md ├── crooks2016tardis.md ├── dan2017using.md ├── dean2008mapreduce.md ├── decandia2007dynamo.md ├── dewitt1990gamma.md ├── dewitt1992parallel.md ├── diaconu2013hekaton.md ├── elphinstone2013l3.md ├── engler1995exokernel.md ├── ghemawat2003google.md ├── ghodsi2011dominant.md ├── gilbert2002brewer.md ├── goldman1997dataguides.md ├── golub1992microkernel.md ├── gonzalez2012powergraph.md ├── gotsman2016cause.md ├── graefe1990encapsulation.md ├── graefe1993volcano.md ├── graefe2009five.md ├── gray1976granularity.md ├── green2013datalog.md ├── halevy2016goods.md ├── harinarayan1996implementing.md ├── hellerstein2007architecture.md ├── hellerstein2010declarative.md ├── hellerstein2012madlib.md ├── herlihy1990linearizability.md ├── hindman2011mesos.md ├── holt2016disciplined.md ├── isard2007dryad.md ├── kandel2011wrangler.md ├── kapritsos2012all.md ├── klonatos2014building.md ├── kohler2000click.md ├── kohler2012declarative.md ├── kornacker2015impala.md ├── kulkarni2015twitter.md ├── kung1981optimistic.md ├── lagar2009snowflock.md ├── lampson1980experience.md ├── lauer1979duality.md ├── lehman1981efficient.md ├── lehman1999t.md ├── letia2009crdts.md ├── li2012making.md ├── li2014automating.md ├── li2014scaling.md ├── lin2016towards.md ├── liu1973scheduling.md ├── lloyd2011don.md ├── madden2002tag.md ├── maddox2016decibel.md ├── mckeen2013innovative.md ├── mckusick1984fast.md ├── miller2000schema.md ├── mohan1986transaction.md ├── moore2006inferring.md ├── murray2013naiad.md ├── o1986escrow.md ├── o1997improved.md ├── ongaro2014search.md ├── prabhakaran2005analysis.md ├── quigley2009ros.md ├── recht2011hogwild.md ├── ritchie1978unix.md ├── roy2015homeostasis.md ├── saltzer1984end.md ├── selinger1979access.md ├── shapiro2011conflict.md ├── sigelman2010dapper.md ├── stoica2001chord.md ├── stonebraker1987design.md ├── stonebraker1991postgres.md ├── stonebraker2005c.md ├── taft2014store.md ├── terry1995managing.md ├── terry2013replicated.md ├── thomson2012calvin.md ├── toshniwal2014storm.md ├── tu2013speedy.md ├── van2004chain.md ├── vavilapalli2013apache.md ├── verma2015large.md ├── vogels2009eventually.md ├── waldspurger1994lottery.md ├── welsh2001seda.md ├── wilkes1996hp.md ├── yu2008dryadlinq.md ├── zaharia2012resilient.md ├── zaharia2013discretized.md ├── zhou2010efficient.md ├── zhou2011tap.md └── zhou2012distributed.md /.gitignore: -------------------------------------------------------------------------------- 1 | # swap 2 | [._]*.s[a-w][a-z] 3 | [._]s[a-w][a-z] 4 | # session 5 | Session.vim 6 | # temporary 7 | .netrwhist 8 | *~ 9 | # auto-generated tag files 10 | tags 11 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | PAPERS = $(wildcard papers/*.md) 2 | HTMLS = $(subst papers/,html/,$(patsubst %.md,%.html,$(PAPERS))) 3 | 4 | default: $(HTMLS) 5 | 6 | html/%.html: papers/%.md header.html footer.html 7 | cat header.html > $@ 8 | pandoc --from markdown-tex_math_dollars-raw_tex --to html --ascii $< >> $@ 9 | cat footer.html >> $@ 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # [Papers](https://mwhittaker.github.io/papers) 2 | This repository is a compendium of notes on papers I've read. 3 | 4 | ## Getting Started 5 | If you want to read the paper summaries, simply go 6 | [here](https://mwhittaker.github.io/papers). If you want to add a paper 7 | summary, or clone this repo and create summaries of your own, here's how 8 | everything works. 9 | 10 | - First, add your summary as a markdown file in [the `papers` 11 | directory](papers/). 12 | - Run `make` to convert each markdown file `foo.md` into an HTML file 13 | `foo.html` using `pandoc`. 14 | - Then, update [`index.html`](index.html) with a link to the HTML file (e.g. 15 | `foo.html`) 16 | 17 | It's as simple as that. Oh, and if you want to include MathJax in your summary, 18 | add the following to the bottom of the markdown file: 19 | 20 | ```html 21 | 24 | ``` 25 | 26 | If you want to add syntax highlighting, add this: 27 | 28 | ```html 29 | 30 | 31 | 32 | ``` 33 | -------------------------------------------------------------------------------- /TODO.md: -------------------------------------------------------------------------------- 1 | - [x] `codd1970relational.md` 2 | - [x] `liu1973scheduling.md` 3 | - [x] `ritchie1978unix.md` 4 | - [x] `gray1976granularity.md` 5 | - [x] `lauer1979duality.md` 6 | - [x] `lampson1980experience.md` 7 | - [x] `chamberlin1981history.md` 8 | - [x] `mckusick1984fast.md` 9 | - [x] `saltzer1984end.md` 10 | - [x] `agrawal1987concurrency.md` 11 | - [x] `stonebraker1987design.md` 12 | - [x] `herlihy1990linearizability.md` 13 | - [x] `stonebraker1991postgres.md` 14 | - [x] `golub1992microkernel.md` 15 | - [ ] `carriero1994linda.md` 16 | - [ ] `waldspurger1994lottery.md` 17 | - [ ] `berenson1995critique.md` 18 | - [ ] `engler1995exokernel.md` 19 | - [ ] `bershad1995spin.md` 20 | - [ ] `wilkes1996hp.md` 21 | - [ ] `bugnion1997disco.md` 22 | - [ ] `lehman1999t.md` 23 | - [ ] `adya2000generalized.md` 24 | - [ ] `kohler2000click.md` 25 | - [ ] `stoica2001chord.md` 26 | - [ ] `moore2006inferring.md` 27 | - [ ] `welsh2001seda.md` 28 | - [ ] `gilbert2002brewer.md` 29 | - [ ] `ghemawat2003google.md` 30 | - [ ] `barham2003xen.md` 31 | - [ ] `van2004chain.md` 32 | - [ ] `dean2008mapreduce.md` 33 | - [ ] `prabhakaran2005analysis.md` 34 | - [ ] `clark2005live.md` 35 | - [ ] `burrows2006chubby.md` 36 | - [ ] `hellerstein2007architecture.md` 37 | - [ ] `isard2007dryad.md` 38 | - [ ] `decandia2007dynamo.md` 39 | - [ ] `chang2008bigtable.md` 40 | - [ ] `yu2008dryadlinq.md` 41 | - [ ] `letia2009crdts.md` 42 | - [ ] `graefe2009five.md` 43 | - [ ] `lagar2009snowflock.md` 44 | - [ ] `alvaro2010boom.md` 45 | - [ ] `hellerstein2010declarative.md` 46 | - [ ] `shapiro2011conflict.md` 47 | - [ ] `alvaro2011consistency.md` 48 | - [ ] `alvaro2011dedalus.md` 49 | - [ ] `ghodsi2011dominant.md` 50 | - [ ] `hindman2011mesos.md` 51 | - [ ] `lloyd2011don.md` 52 | - [ ] `conway2012logic.md` 53 | - [ ] `li2012making.md` 54 | - [ ] `zaharia2012resilient.md` 55 | - [ ] `vavilapalli2013apache.md` 56 | - [ ] `zaharia2013discretized.md` 57 | - [ ] `elphinstone2013l3.md` 58 | - [ ] `mckeen2013innovative.md` 59 | - [ ] `akidau2013millwheel.md` 60 | - [ ] `murray2013naiad.md` 61 | - [ ] `terry2013replicated.md` 62 | - [ ] `li2014automating.md` 63 | - [ ] `bailis2014coordination.md` 64 | - [ ] `conway2014edelweiss.md` 65 | - [ ] `bailis2013highly.md` 66 | - [ ] `ongaro2014search.md` 67 | - [ ] `baumann2015shielding.md` 68 | - [ ] `toshniwal2014storm.md` 69 | - [ ] `roy2015homeostasis.md` 70 | - [ ] `kornacker2015impala.md` 71 | - [ ] `verma2015large.md` 72 | - [ ] `balegas2015putting.md` 73 | - [ ] `armbrust2015spark.md` 74 | - [ ] `kulkarni2015twitter.md` 75 | - [ ] `burns2016borg.md` 76 | - [ ] `gotsman2016cause.md` 77 | - [ ] `maddox2016decibel.md` 78 | - [ ] `holt2016disciplined.md` 79 | - [ ] `halevy2016goods.md` 80 | - [ ] `chen2016realtime.md` 81 | - [ ] `crooks2016tardis.md` 82 | -------------------------------------------------------------------------------- /conferences/README.md: -------------------------------------------------------------------------------- 1 | - [ ] CIDR 2 | - [ ] SIGMOD 3 | - [ ] SOCC 4 | - [ ] VLDB 5 | - [ ] EuroSys 6 | - [ ] NSDI 7 | - [ ] OSDI 8 | - [ ] USENIX ATC 9 | - [ ] TODS 10 | - [ ] VLDBJ 11 | -------------------------------------------------------------------------------- /conferences/sigmod2016.md: -------------------------------------------------------------------------------- 1 | ### Session 12: Distributed Data Processing 2 | 1. **Realtime Data Processing at Facebook** Facebook describes five design 3 | decisions faced when building a stream processing system, what decisions 4 | existing stream processing system made, and how their stream processing 5 | systems (Puma, Swift, and Stylus) differ. 6 | 2. **SparkR: Scaling R Programs with Spark** R is convenient but slow. This 7 | paper presents an R package that provides a frontend to Apache Spark to 8 | allow large scale data analysis from within R. 9 | 3. **VectorH: taking SQL-on-Hadoop to the next level** VectorH is a 10 | SQL-on-Hadoop system built on the Vectorwise database. VectorH builds on 11 | HDFS and YARN and supports ordered tables uing Positional Delta Trees. It is 12 | orders of magnitude faster than existing SQL-on-Hadoop systems HAWQ, Impala, 13 | SparkSQL, and Hive. 14 | 4. **Adaptive Logging: Optimizing Logging and Recovery Costs in Distributed 15 | In-memory Databases** Some memory databases have eschewed ARIES-like data 16 | logging for command logging to reduce the size of logs. However, this 17 | increases recovery time. This paper improves command logging allowing nodes 18 | to recover in parallel and also introduces an adaptive logging scheme that 19 | uses both data and command logging. 20 | 5. **Big Data Analytics with Datalog Queries on Spark** This paper presents 21 | BigDatalog: a system which runs Datalog efficiently using Apache Spark by 22 | exploiting compilation and optimization techniques. 23 | 6. **An Efficient MapReduce Cube Algorithm for Varied Data Distributions** This 24 | paper presents an algorithm that computes Data cubes using MapReduce that 25 | can tolerate skewed data distributions using a Skews and Partitions Sketch 26 | data structure. 27 | 28 | ### Session 18: Transactions and Consistency 29 | 1. **TARDiS: A Branch-and-Merge Approach To Weak Consistency** 30 | TARDiS is a transactional key-value store explicitly designed with weak 31 | consistency in mind. TARDiS exposes the set of conflicting branches in an 32 | eventually consistent system and allows clients to merge when desired. 33 | 2. **TicToc: Time Traveling Optimistic Concurrency Control** TicToc is a new 34 | timestamp management protocol that assigns read and write timestamps to data 35 | items and lazily computes commit timestamps. 36 | 3. **Scaling Multicore Databases via Constrained Parallel Execution** 2PL and 37 | OCC sacrifice parallelism in high contention workloads. This paper presents 38 | a new concurrency control scheme: interleaving constrained concurrency 39 | control (IC3). IC3 uses static analysis and runtime techniques. 40 | 4. **Towards a Non-2PC Transaction Management in Distributed Database Systems** 41 | Shared nothing transactional databases have fast local transactions but slow 42 | distributed transactions due to 2PC. This paper introduces the LEAP 43 | transaction management scheme that converts distributed transactions into 44 | local ones. Their database L-store is compared against H-store. 45 | 5. **ERMIA: Fast memory-optimized database system for heterogeneous workloads** 46 | ERMIA is a database that optimizes heterogeneous read-mostly workloads using 47 | snapshot isolation that performs better than traditional OCC systems. 48 | 6. **Transaction Healing: Scaling Optimistic Concurrency Control on Multicores** 49 | Multicore OCC databases don't scale well with high contention workloads. 50 | This paper presents a new concurrency control mechanism known as transaction 51 | healing where dependencies are analyzed ahead of time and transactions are 52 | not full-blown aborted when validation fails. Transaction healing is 53 | implemented in TheDB. 54 | -------------------------------------------------------------------------------- /css/default_highlight.css: -------------------------------------------------------------------------------- 1 | /* 2 | 3 | Original highlight.js style (c) Ivan Sagalaev 4 | 5 | */ 6 | 7 | .hljs { 8 | /* display: block; */ 9 | /* overflow-x: auto; */ 10 | /* padding: 0.5em; */ 11 | /* background: #F0F0F0; */ 12 | } 13 | 14 | 15 | /* Base color: saturation 0; */ 16 | 17 | .hljs, 18 | .hljs-subst { 19 | color: #444; 20 | } 21 | 22 | .hljs-comment { 23 | color: #888888; 24 | } 25 | 26 | .hljs-keyword, 27 | .hljs-attribute, 28 | .hljs-selector-tag, 29 | .hljs-meta-keyword, 30 | .hljs-doctag, 31 | .hljs-name { 32 | font-weight: bold; 33 | } 34 | 35 | 36 | /* User color: hue: 0 */ 37 | 38 | .hljs-type, 39 | .hljs-string, 40 | .hljs-number, 41 | .hljs-selector-id, 42 | .hljs-selector-class, 43 | .hljs-quote, 44 | .hljs-template-tag, 45 | .hljs-deletion { 46 | color: #880000; 47 | } 48 | 49 | .hljs-title, 50 | .hljs-section { 51 | color: #880000; 52 | font-weight: bold; 53 | } 54 | 55 | .hljs-regexp, 56 | .hljs-symbol, 57 | .hljs-variable, 58 | .hljs-template-variable, 59 | .hljs-link, 60 | .hljs-selector-attr, 61 | .hljs-selector-pseudo { 62 | color: #BC6060; 63 | } 64 | 65 | 66 | /* Language color: hue: 90; */ 67 | 68 | .hljs-literal { 69 | color: #78A960; 70 | } 71 | 72 | .hljs-built_in, 73 | .hljs-bullet, 74 | .hljs-code, 75 | .hljs-addition { 76 | color: #397300; 77 | } 78 | 79 | 80 | /* Meta color: hue: 200 */ 81 | 82 | .hljs-meta { 83 | color: #1f7199; 84 | } 85 | 86 | .hljs-meta-string { 87 | color: #4d99bf; 88 | } 89 | 90 | 91 | /* Misc effects */ 92 | 93 | .hljs-emphasis { 94 | font-style: italic; 95 | } 96 | 97 | .hljs-strong { 98 | font-weight: bold; 99 | } 100 | -------------------------------------------------------------------------------- /css/style.css: -------------------------------------------------------------------------------- 1 | html { 2 | font-family: "Helvetica Neue",Helvetica,Arial,sans-serif; 3 | background: #FEFEFE; 4 | } 5 | 6 | body { 7 | color: #333; 8 | margin: 0; 9 | padding: 0; 10 | font-size: 18px; 11 | } 12 | 13 | #header { 14 | background-color: #f44336; 15 | color: white; 16 | font-size: 24pt; 17 | font-weight: bold; 18 | padding-bottom: 10px; 19 | padding-top: 10px; 20 | text-align: center; 21 | } 22 | 23 | #header a { 24 | color: white; 25 | text-decoration: none; 26 | } 27 | 28 | #container { 29 | padding-left: 1em; 30 | padding-right: 1em; 31 | max-width: 44em; 32 | line-height: 27px; 33 | padding-top: 20px; 34 | padding-bottom: 20px; 35 | margin-left: auto; 36 | margin-right: auto; 37 | } 38 | 39 | @media screen and (min-width: 600px) { 40 | #container { 41 | padding-left: 28px; 42 | padding-right: 28px; 43 | } 44 | } 45 | 46 | #indextitle { 47 | font-size: 60pt; 48 | margin-top: 0pt; 49 | margin-bottom: 0pt; 50 | } 51 | 52 | .year { 53 | color: gray; 54 | } 55 | 56 | a { text-decoration: none; } 57 | h1 a:link, h1 a:visited { color: #222; } 58 | h1 a:active, h1 a:hover { color: #444; } 59 | #paperlist a:link, #paperlist a:visited { color: #222; } 60 | #paperlist a:active, #paperlist a:hover { color: #444; } 61 | 62 | h1 { 63 | line-height: 1; 64 | margin-top: 0em; 65 | margin-bottom: 0.55em; 66 | } 67 | 68 | h2 { 69 | margin-top: 0.9em; 70 | margin-bottom: 0.45em; 71 | } 72 | 73 | h3 { 74 | margin-top: 0.8em; 75 | margin-bottom: 0.35em; 76 | } 77 | 78 | h4 { 79 | margin-top: 0.7em; 80 | margin-bottom: 0.25em; 81 | } 82 | 83 | p { 84 | margin-top: 0em; 85 | margin-bottom: 1.15em; 86 | } 87 | 88 | pre { 89 | margin-top: 0em; 90 | margin-bottom: 1.15em; 91 | background-color: #f7f7f7; 92 | overflow: auto; 93 | padding: 5pt; 94 | } 95 | 96 | @media screen and (min-width: 600px) { 97 | pre { 98 | padding: 16pt; 99 | } 100 | } 101 | 102 | code { 103 | font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace; 104 | } 105 | 106 | table { 107 | margin-bottom: 1.15em; 108 | border-collapse: collapse; 109 | } 110 | 111 | td, th { 112 | padding: 2pt; 113 | border: 1.25pt solid gray; 114 | } 115 | 116 | blockquote { 117 | padding-left: 1em; 118 | color: #777; 119 | border-left: 0.25em solid #ddd; 120 | } 121 | 122 | .math { 123 | overflow: auto; 124 | } 125 | -------------------------------------------------------------------------------- /footer.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 13 | 14 | 15 | -------------------------------------------------------------------------------- /header.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 | -------------------------------------------------------------------------------- /html/adya2000generalized.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Generalized Isolation Level Definitions (2000)

15 |

Summary. In addition to serializability, ANSI SQL-92 defined a set of weaker isolation levels that applications could use to improve performance at the cost of consistency. The definitions were implementation-independent but ambiguous. Berenson et al. proposed a revision of the isolation level definitions that was unambiguous but specific to locking. Specifically, they define a set of phenomena:

16 |
    17 |
  • P0: w1(x) ... w2(x) ... "dirty write"
  • 18 |
  • P1: w1(x) ... r2(x) ... "dirty read"
  • 19 |
  • P2: r1(x) ... w2(x) ... "unrepeatable read"
  • 20 |
  • P3: r1(P) ... w2(y in P) ... "phantom read"
  • 21 |
22 |

and define the isolation levels according to which phenomena they preclude. This preclusion can be implemented by varying how long certain types of locks are held:

23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 |
write locksread locksphantom locksprecluded
shortshortshortP0
longshortshortP0, P1
longlongshortP0, P1, P2
longlonglongP0, P1, P2, P3
59 |

This locking-specific preventative approach to defining isolation levels, while unambiguous, rules out many non-locking implementations of concurrency control. Notably, it does not allow for multiversioning and does not allow non-committed transactions to experience weaker consistency than committed transactions. Moreover, many isolation levels are naturally expressed as invariants between multiple objects, but these definitions are all over a single object.

60 |

This paper introduces implementation-independent unambiguous isolation level definitions. The definitions also include notions of predicates at all levels. It does so by first introducing the definition of a history as a partial order of read/write/commit/abort events and total order of commited object versions. It then introduces three dependencies: read-dependencies, anti-dependencies, and write-dependencies (also known as write-read, read-write, and write-write dependencies). Next, it describes how to construct dependency graph and defines isolation levels as constraints on these graphs.

61 |

For example, the G0 phenomenon says that a dependency graph contains a write-dependency cycle. PL-1 is the isolation level that precludes G0. Similarly, the G1 phenomenon says that either

62 |
    63 |
  1. a committed transaction reads an aborted value,
  2. 64 |
  3. a committed transaction reads an intermediate value, or
  4. 65 |
  5. there is a write-read/write-write cycle.
  6. 66 |
67 |

The PL-2 isolation level precludes G1 (and therefore G0) and corresponds roughly to the READ-COMMITTED isolation level.

68 |
69 | 70 | 71 | 80 | 81 | 82 | -------------------------------------------------------------------------------- /html/alvaro2010boom.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud (2010)

15 |

Summary. Programming distributed systems is hard. Really hard. This paper conjectures that data-centric programming done with declarative programming languages can lead to simpler distributed systems that are more correct with less code. To support this conjecture, the authors implement an HDFS and Hadoop clone in Overlog, dubbed BOOM-FS and BOOM-MR respectively, using orders of magnitude fewer lines of code that the original implementations. They also extend BOOM-FS with increased availability, scalability, and monitoring.

16 |

An HDFS cluster consists of NameNodes responsible for metadata management and DataNodes responsible for data management. BOOM-FS reimplements the metadata protocol in Overlog; the data protocol is implemented in Java. The implementation models the entire system state (e.g. files, file paths, heartbeats, etc.) as data in a unified way by storing them as collections. The Overlog implementation of the NameNode then operates on and updates these collections. Some of the data (e.g. file paths) are actually views that can be optionally materialized and incrementally maintained. After reaching (almost) feature parity with HDFS, the authors increased the availability of the NameNode by using Paxos to introduce a hot standby replicas. They also partition the NameNode metadata to increase scalability and use metaprogramming to implement monitoring.

17 |

BOOM-MR plugs in to the existing Hadoop code but reimplements two MapReduce scheduling algorithms: Hadoop's first-come first-server algorithm, and Zaharia's LATE policy.

18 |

BOOM Analytics was implemented in order of magnitude fewer lines of code thanks to the data-centric approach and the declarative programming language. The implementation is also almost as fast as the systems they copy.

19 |
20 | 21 | 22 | 31 | 32 | 33 | -------------------------------------------------------------------------------- /html/alvaro2011consistency.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Consistency Analysis in Bloom: a CALM and Collected Approach (2011)

15 |

Summary. Strong consistency eases reasoning about distributed systems, but it requires coordination which entails higher latency and unavailability. Adopting weaker consistency models can improve system performance but writing applications against these weaker guarantees can be nuanced, and programmers don't have any reasoning tools at their disposal. This paper introduces the CALM conjecture and Bloom, a disorderly declarative programming language based on CALM, which allows users to write loosely consistent systems in a more principled way.

16 |

Say we've eschewed strong consistency, how do we know our program is even eventually consistent? For example, consider a distributed register in which servers accept writes on a first come first serve basis. Two clients could simultaneously write two distinct values x and y concurrently. One server may receive x then y; the other y then x. This system is not eventually consistent. Even after client requests have quiesced, the distributed register is in an inconsistent state. The CALM conjecture embraces Consistency As Logical Monotonicity and says that logically monotonic programs are eventually consistent for any ordering and interleaving of message delivery and computation. Moreover, they do not require waiting or coordination to stream results to clients.

17 |

Bud (Bloom under development) is the realization of the CALM theorem. It is Bloom implemented as a Ruby DSL. Users declare a set of Bloom collections (e.g. persistent tables, temporary tables, channels) and an order independent set of declarative statements similar to Datalog. Viewing Bloom through the lens of its procedural semantics, Bloom execution proceeds in timesteps, and each timestep is divided into three phases. First, external messages are delivered to a node. Second, the Bloom statements are evaluated. Third, messages are sent off elsewhere. Bloom also supports modules and interfaces to improve modularity.

18 |

The paper also implements a key-value store and shopping cart using Bloom and uses various visualization tools to guide the design of coordination-free implementations.

19 |
20 | 21 | 22 | 31 | 32 | 33 | -------------------------------------------------------------------------------- /html/armbrust2015spark.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Spark SQL: Relational Data Processing in Spark (2015)

15 |

Summary. Data processing frameworks like MapReduce and Spark can do things that relational databases can't do very easily. For example, they can operate over semi-structured or unstructured data, and they can perform advanced analytics. On the other hand, Spark's API allows user to run arbitrary code (e.g. rdd.map(some_arbitrary_function)) which prevents Spark from performing certain optimizations. Spark SQL marries imperative Spark-like data processing with declarative SQL-like data processing into a single unified interface.

16 |

Spark's main abstraction was an RDD. Spark SQL's main abstraction is a DataFrame: the Spark analog of a table which supports a nested data model of standard SQL types as well as structs, arrays, maps, unions, and user defined types. DataFrames can be manipulated as if they were RDDs of row objects (e.g. dataframe.map(row_func)), but they also support a set of standard relational operators which take ASTs, built using a DSL, as arguments. For example, the code users.where(users("age") < 40) constructs an AST from users("age") < 40 as an argument to filter the users DataFrame. By passing in ASTs as arguments rather than arbitrary user code, Spark is able to perform optimizations it previously could not do. DataFrames can also be queries using SQL.

17 |

Notably, integrating queries into an existing programming language (e.g. Scala) makes writing queries much easier. Intermediate subqueries can be reused, queries can be constructed using standard control flow, etc. Moreover, Spark eagerly typechecks queries even though their execution is lazy. Furthermore, Spark SQL allows users to create DataFrames of language objects (e.g. Scala objects), and UDFs are just normal Scala functions.

18 |

DataFrame queries are optimized and manipulated by a new extensible query optimizer called Catalyst. The query optimizer manipulates ASTs written in Scala using rules, which are just functions from trees to trees that typically use pattern matching. Queries are optimized in four phases:

19 |
    20 |
  1. Analysis. First, relations and columns are resolved, queries are typechecked, etc.
  2. 21 |
  3. Logical optimization. Typical logical optimizations like constant folding, filter pushdown, boolean expression simplification, etc are performed.
  4. 22 |
  5. Physical planning. Cost based optimization is performed.
  6. 23 |
  7. Code generation. Scala quasiquoting is used for code generation.
  8. 24 |
25 |

Catalyst also makes it easy for people to add new data sources and user defined types.

26 |

Spark SQL also supports schema inference, ML integration, and query federation: useful features for big data.

27 |
28 | 29 | 30 | 39 | 40 | 41 | -------------------------------------------------------------------------------- /html/bailis2013highly.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Highly Available Transactions: Virtues and Limitations (2014)

15 |

Summary. Serializability is the gold standard of consistency, but databases have always provided weaker consistency modes (e.g. Read Committed, Repeatable Read) that promise improved performance. In this paper, Bailis et al. determine which of these weaker consistency models can be implemented with high availability.

16 |

First, why is high availability important?

17 |
    18 |
  1. Partitions. Partitions happen, and when they do non-available systems become, well, unavailable.
  2. 19 |
  3. Latency. Partitions may be transient, but latency is forever. Highly available systems can avoid latency by eschewing coordination costs.
  4. 20 |
21 |

Second, are weaker consistency models consistent enough? In short, yeah probably. In a survey of databases, Bailis finds that many do not employ serializability by default and some do not even provide full serializability. Bailis also finds that four of the five transactions in the TPC-C benchmark can be implemented with highly available transactions.

22 |

After defining availability, Bailis presents the taxonomy of which consistency can be implemented as HATs, and also argues why some fundamentally cannot. He also performs benchmarks on AWS to show the performance benefits of HAT.

23 |
24 | 25 | 26 | 35 | 36 | 37 | -------------------------------------------------------------------------------- /html/bailis2014coordination.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Coordination Avoidance in Database Systems (2014)

15 |

Overivew. Coordination in a distributed system is sometimes necessary to maintain application correctness, or consistency. For example, a payroll application may require that each employee has a unique ID, or that a jobs relation only include valid employees. However, coordination is not cheap. It increases latency, and in the face of partitions can lead to unavailability. Thus, when application correctness permits, coordination should be avoided. This paper develops the necessary and sufficient conditions for when coordination is needed to maintain a set of database invariance using a notion of invariant-confluence or I-confluence.

16 |

System Model. A database state is a set D of object versions drawn from the set of all states D. Transactions operate on logical replicas in D that contain the set of object versions relevant to the transaction. Transactions are modeled as functions T : D -> D. The effects of a transaction are merged into an existing replica using an associative, commutative, idempotent merge operator. Changes are shared between replicas and merges likewise. In this paper, merge is set union, and we assume we know all transactions in advance. Invariants are modeled as boolean functions I: D -> 2. A state R is said to be I-valid if I(R) is true.

17 |

We say a system has transactional availability if whenever a transaction T can contact servers with the appropriate data in T, it only aborts if T chooses to abort. We say a system is convergent if after updates quiesce, all servers eventually have the same state. A system is globally I-valid if all replicas always have I-valid states. A system provides coordination-free execution if execution of a given transaction does not depend on the execution of others.

18 |

Consistency Sans Coordination. A state Si is I-T-Reachable if its derivable from I-valid states with transactions in T. A set of transactions is I-confluent with respect to invariant I if for all I-T-Reachable states Di, Dj with common ancestor Di join Dj is I-valid. A globally I-valid system can execute a set of transactions T with global validity, transactional availability, convergence, and coordination-freedom if and only if T is I-confluent with respect to I.

19 |

Applying Invariant-Confluence. I-confluence can be applied to existing relation operators and constraints. For example, updates, inserts, and deletes are I-confluent with respect to per-record inequality constraint. Deletions are I-confluent with respect to foreign key constraints; additions and updates are not. I-confluence can also be applied to abstract data types like counters.

20 |
21 | 22 | 23 | 32 | 33 | 34 | -------------------------------------------------------------------------------- /html/baumann2015shielding.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Shielding Applications from an Untrusted Cloud with Haven (2014)

15 |

Summary. When running an application in the cloud, users have to trust (i) the cloud provider's software, (ii) the cloud provider's staff, and (iii) law enforcement with the ability to access user data. Intel SGX partially solves this problem by allowing users to run small portions of program on remote servers with guarantees of confidentiality and integrity. Haven leverages SGX and Drawbridge to run entire legacy programs with shielded execution.

16 |

Haven assumes a very strong adversary which has access to all the system's software and most of the system's hardware. Only the processor and SGX hardware is trusted. Haven provides confidentiality and integrity, but not availability. It also does not prevent side-channel attacks.

17 |

There are two main challenges that Haven's design addresses. First, most programs are written assuming a benevolent host. This leads to Iago attacks in which the OS subverts the application by exploiting its assumptions about the OS. Haven must operate correctly despite a malicious host. To do so, Haven uses a library operation system LibOS that is part of a Windows sandboxing framework called Drawbridge. LibOS implements a full OS API using only a few core host OS primitives. These core host OS primitives are used in a defensive way. A shield module sits below LibOS and takes great care to ensure that LibOS is not susceptible to Iago attacks. The user's application, LibOS, and the shield module are all run in an SGX enclave.

18 |

Second, Haven aims to run unmodified binaries which were not written with knowledge of SGX. Real world applications allocate memory, load and run code dynamically, etc. Many of these things are not supported by SGX, so Haven (a) emulated them and (b) got the SGX specification revised to address them.

19 |

Haven also implements an in-enclave encrypted file system in which only the root and leaf pages need to be written to stable storage. As of publication, however, Haven did not fully implement this feature. Haven is susceptible to replay attacks.

20 |

Haven was evaluated by running Microsoft SQL Server and Apache HTTP Server.

21 |
22 | 23 | 24 | 33 | 34 | 35 | -------------------------------------------------------------------------------- /html/bershad1995spin.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

SPIN -- An Extensible Microkernel for Application-specific Operating System Services (1995)

15 |

Summary. Many operating systems were built a long time ago, and their performance was tailored to the applications and workloads at the time. More recent applications, like databases and multimedia applications, are quite different than these applications and can perform quite poorly on existing operating systems. SPIN is an extensible microkernel that allows applications to tailor the operating system to meet their needs.

16 |

Existing operating systems fit into one of three categories:

17 |
    18 |
  1. They have no interface by which applications can modify kernel behavior.
  2. 19 |
  3. They have a clean interface applications can use to modify kernel behavior but the implementation of the interface is inefficient.
  4. 20 |
  5. They have an unconstrained interface that is efficiently implemented but does not provide isolation between applications.
  6. 21 |
22 |

SPIN provides applications a way to efficiently and safely modify the behavior of the kernel. Programs in SPIN are divided into the user-level code and a spindle: a portion of user code that is dynamically installed and run in the kernel. The kernel provides a set of abstractions for physical and logical resources, and the spindles are responsible for managing these resources. The spindles can also register to be invoked when certain kernel events (i.e. page faults) occur. Installing spindles directly into the kernel provides efficiency. Applications can execute code in the kernel without the need for a context switch.

23 |

To ensure safety, spindles are written in a typed object-oriented language. Each spindle is like an object; it contains local state and a set of methods. Some of these methods can be called by the application, and some are registered as callbacks in the kernel. A spindle checker uses a combination of static analysis and runtime checks to ensure that the spindles meet certain kernel invariants. Moreover, SPIN relies on advanced compiler technology to ensure efficient spindle compilation.

24 |

General purpose high-performance computing, parallel processing, multimedia applications, databases, and information retrieval systems can benefit from the application-specific services provided by SPIN. Using techniques such as

25 |
    26 |
  • extensible IPC;
  • 27 |
  • application-level protocol processing;
  • 28 |
  • fast, simple, communication;
  • 29 |
  • application-specific file systems and buffer cache management;
  • 30 |
  • user-level scheduling;
  • 31 |
  • optimistic transaction;
  • 32 |
  • real-time scheduling policies;
  • 33 |
  • application-specific virtual memory; and
  • 34 |
  • runtime systems with memory system feedback,
  • 35 |
36 |

applications can be implemented more efficiently on SPIN than on traditional operating systems.

37 |
38 | 39 | 40 | 49 | 50 | 51 | -------------------------------------------------------------------------------- /html/bugnion1997disco.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Disco: Running Commodity Operating Systems on Scalable Multiprocessors (1997)

15 |

Summary. Operating systems are complex, million line code bases. Multiprocessors were becoming popular, but it was too difficult to modify existing commercial operating systems to take full advantage of the new hardware. Disco is a virtual machine monitor, or hypervisor, that uses virtualization to run commercial virtual machines on cache-coherent NUMA multiprocessors. Guest operating systems running on Disco are only slightly modified, yet are still able to take advantage of the multiprocessor. Moreover, Disco offers all the traditional benefits of a hypervisor (e.g. fault isolation).

16 |

Disco provides the following interfaces:

17 |
    18 |
  • Processors. Disco provides full virtualization of the CPU allowing for restricted direct execution. Some privileged registers are mapped to memory to allow guest operating systems to read them.
  • 19 |
  • Physical memory. Disco virtualizes the guest operating system's physical address spaces, mapping them to hardware addresses. It also supports page migration and replication to alleviate the non-uniformity of a NUMA machine.
  • 20 |
  • I/O devices. All I/O communication is simulated, and various virtual disks are used.
  • 21 |
22 |

Disco is implemented as follows:

23 |
    24 |
  • Virtual CPUs. Disco maintains the equivalent of a process table entry for each guest operating system. Dicso runs in kernel mode, guest operating systems run in supervisor mode, and applications run in user mode.
  • 25 |
  • Virtual physical memory. To avoid the overhead of physical to hardware address translation, Disco maintains a large software physical to hardware TLB.
  • 26 |
  • NUMA. Disco migrates pages to the CPUs that access them frequently, and replicates read-only pages to the CPUs that read them frequently. This dynamic page migration and replications helps mask the non-uniformity of a NUMA machine.
  • 27 |
  • Copy-on-write disks. Disco can map physical addresses in different guest operating systems to a read-only page in hardware memory. This lowers the memory footprint of running multiple guest operating systems. The shared pages are copy-on-write.
  • 28 |
  • Virtual network interfaces. Disco runs a virtual subnet over which guests can communicate using standard communication protocols like NFS and TCP. Disco uses a similar copy-on-write trick as above to avoid copying data between guests.
  • 29 |
30 |
31 | 32 | 33 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /html/burns2016borg.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Borg, Omega, and Kubernetes (2016)

15 |

Summary. Google has spent the last decade developing three container management systems. Borg is Google's main cluster management system that manages long running production services and non-production batch jobs on the same set of machines to maximize cluster utilization. Omega is a clean-slate rewrite of Borg using more principled architecture. In Omega, all system state lives in a consistent Paxos-based storage system that is accessed by a multitude of components which act as peers. Kubernetes is the latest open source container manager that draws on lessons from both previous systems.

16 |

All three systems use containers for security and performance isolation. Container technology has evolved greatly since the inception of Borg from chroot to jails to cgroups. Of course containers cannot prevent all forms of performance isolation. Today, containers also contain program images.

17 |

Containers allow the cloud to shift from a machine-oriented design to an application oriented-design and tout a number of advantages.

18 |
    19 |
  • The gap between where an application is developed and where it is deployed is shrunk.
  • 20 |
  • Application writes don't have to worry about the details of the operating system on which the application will run.
  • 21 |
  • Infrastructure operators can upgrade hardware without worrying about breaking a lot of applications.
  • 22 |
  • Telemetry is tied to applications rather than machines which improves introspection and debugging.
  • 23 |
24 |

Container management systems typically also provide a host of other niceties including:

25 |
    26 |
  • naming services,
  • 27 |
  • autoscaling,
  • 28 |
  • load balancing,
  • 29 |
  • rollout tools, and
  • 30 |
  • monitoring tools.
  • 31 |
32 |

In borg, these features were integrated over time in ad-hoc ways. Kubernetes organizes these features under a unified and flexible API.

33 |

Google's experience has led a number of things to avoid:

34 |
    35 |
  • Container management systems shouldn't manage ports. Kubernetes gives each job a unique IP address allowing it to use any port it wants.
  • 36 |
  • Containers should have labels, not just numbers. Borg gave each task an index within its job. Kubernetes allows jobs to be labeled with key-value pairs and be grouped based on these labels.
  • 37 |
  • In Borg, every task belongs to a single job. Kubernetes makes task management more flexible by allowing a task to belong to multiple groups.
  • 38 |
  • Omega exposed the raw state of the system to its components. Kubernetes avoids this by arbitrating access to state through an API.
  • 39 |
40 |

Despite the decade of experience, there are still open problems yet to be solved:

41 |
    42 |
  • Configuration. Configuration languages begin simple but slowly evolve into complicated and poorly designed Turing complete programming languages. It's ideal to have configuration files be simple data files and let real programming languages manipulate them.
  • 43 |
  • Dependency management. Programs have lots of dependencies but don't manually state them. This makes automated dependency management very tough.
  • 44 |
45 |
46 | 47 | 48 | 57 | 58 | 59 | -------------------------------------------------------------------------------- /html/codd1970relational.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

A Relational Model of Data for Large Shared Data Banks (1970)

15 |

Summary

16 |

In this paper, Ed Codd introduces the relational data model. Codd begins by motivating the importance of data independence: the independence of the way data is queried and the way data is stored. He argues that existing database systems at the time lacked data independence; namely, the ordering of relations, the indexes on relations, and the way the data was accessed was all made explicit when the data was queried. This made it impossible for the database to evolve the way data was stored without breaking existing programs which queried the data. The relational model, on the other hand, allowed for a much greater degree of data independence. After Codd introduces the relational model, he provides an algorithm to convert a relation (which may contain other relations) into first normal form (i.e. relations cannot contain other relations). He then describes basic relational operators, data redundancy, and methods to check for database consistency.

17 |

Commentary

18 |
    19 |
  1. Codd's advocacy for data independence and a declarative query language have stood the test of time. I particularly enjoy one excerpt from the paper where Codd says, "The universality of the data sublanguage lies in its descriptive ability (not its computing ability)".
  2. 20 |
  3. Database systems at the time generally had two types of data: collections and links between those collections. The relational model represented both as relations. Today, this seems rather mundane, but I can imagine this being counterintuitive at the time. This is also yet another example of a unifying interface which is demonstrated in both the Unix and System R papers.
  4. 21 |
22 |
23 | 24 | 25 | 34 | 35 | 36 | -------------------------------------------------------------------------------- /html/conway2012logic.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Logic and Lattices for Distributed Programming (2012)

15 |

Summary. CRDTs provide eventual consistency without the need for coordination. However, they suffer a scope problem: simple CRDTs are easy to reason about and use, but more complicated CRDTs force programmers to ensure they satisfy semilattice properties. They also lack composability. Consider, for example, a Students set and a Teams set. (Alice, Bob) can be added to Teams while concurrently Bob is removed from Students. Each individual set may be a CRDT, but there is no mechanism to enforce consistency between the CRDTs.

16 |

Bloom and CALM, on the other hand, allow for mechanized program analysis to guarantee that a program can avoid coordination. However, Bloom suffers from a type problem: it only operates on sets which procludes the use of other useful structures such as integers.

17 |

This paper merges CRDTs and Bloom together by introducing bounded join semilattices into Bloom to form a new language: Bloom^L. Bloom^L operates over semilattices, applying semilattice methods. These methods can be designated as non-monotonic, monotonic, or homomorphic (which implies monotonic). So long as the program avoids non-monotonic methods, it can be realized without coordination. Moreover, morphisms can be implemented more efficiently than non-homomorphic monotonic methods. Bloom^L comes built in with a family of useful semilattices like booleans ordered by implication, integers ordered by less than and greater than, sets, and maps. Users can also define their own semilattices, and Bloom^L allows smooth interoperability between Bloom collections and Bloom^L lattices. Bloom^L's semi-naive implementation is comparable to Bloom's semi-naive implementation.

18 |

The paper also presents two case-studies. First, they implement a key-value store as a map from keys to values annotated with vector clocks: a design inspired from Dynamo. They also implement a purely monotonic shopping cart using custom lattices.

19 |
20 | 21 | 22 | 31 | 32 | 33 | -------------------------------------------------------------------------------- /html/halevy2016goods.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Goods: Organizing Google's Datasets (2016)

15 |

Summary. In fear of fettering development and innovation, companies often allow engineers free reign to generate and analyze datasets at will. This often leads to unorganized data lakes: a ragtag collection of datasets from a diverse set of sources. Google Dataset Search (Goods) is a system which uses unobstructive post-hoc metadata extraction and inference to organize Google's unorganized datasets and present curated dataset information, such as metadata and provenance, to engineers.

16 |

Building a system like Goods at Google scale presents many challenges.

17 |
    18 |
  • Scale. There are 26 billion datasets. 26 billion (with a b)!
  • 19 |
  • Variety. Data comes from a diverse set of sources (e.g. BigTable, Spanner, logs).
  • 20 |
  • Churn. Roughly 5% of the datasets are deleted everyday, and datasets are created roughly as quickly as they are deleted.
  • 21 |
  • Uncertainty. Some metadata inference is approximate and speculative.
  • 22 |
  • Ranking. To facilitate useful dataset search, datasets have to be ranked by importance: a difficult heuristic-driven process.
  • 23 |
  • Semantics. Extracting the semantic content of a dataset is useful but challenging. For example consider a file of protos that doesn't reference the type of proto being stored.
  • 24 |
25 |

The Goods catalog is a BigTable keyed by dataset name where each row contains metadata including

26 |
    27 |
  • basic metatdata like timestamp, owners, and access permissions;
  • 28 |
  • provenance showing the lineage of each dataset;
  • 29 |
  • schema;
  • 30 |
  • data summaries extracted from source code; and
  • 31 |
  • user provided annotations.
  • 32 |
33 |

Moreover, similar datasets or multiple versions of the same logical dataset are grouped together to form clusters. Metadata for one element of a cluster can be used as metadata for other elements of the cluster, greatly reducing the amount of metadata that needs to be computed. Data is clustered by timestamp, data center, machine, version, and UID, all of which is extracted from dataset paths (e.g. /foo/bar/montana/August01/foo.txt).

34 |

In addition to storing dataset metadata, each row also stores status metadata: information about the completion status of various jobs which operate on the catalog. The numerous concurrently executing batch jobs use status metadata as a weak form of synchronization and dependency resolution, potentially deferring the processing of a row until another job has processed it.

35 |

The fault tolerance of these jobs is provided by a mix of job retries, BigTable's idempotent update semantics, and a watchdog that terminates divergent programs.

36 |

Finally, a two-phase garbage collector tombstones rows that satisfy a garbage collection predicate and removes them one day later if they still match the predicate. Batch jobs do not process tombstoned rows.

37 |

The Goods frontend includes dataset profile pages, dataset search driven by a handful of heuristics to rank datasets by importance, and teams dashboard.

38 |
39 | 40 | 41 | 50 | 51 | 52 | -------------------------------------------------------------------------------- /html/hindman2011mesos.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 | 17 | 18 | 19 | 28 | 29 | 30 | -------------------------------------------------------------------------------- /html/kornacker2015impala.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Impala: A Modern, Open-Source SQL Engine for Hadoop (2015)

15 |

Summary. Impala is a distributed query engine built on top of Hadoop. That is, it builds off of existing Hadoop tools and frameworks and reads data stored in Hadoop file formats from HDFS.

16 |

Impala's CREATE TABLE commands specify the location and file format of data stored in Hadoop. This data can also be partitioned into different HDFS directories based on certain column values. Users can then issue typical SQL queries against the data. Impala supports batch INSERTs but doesn't support UPDATE or DELETE. Data can also be manipulated directly by going through HDFS.

17 |

Impala is divided into three components.

18 |
    19 |
  1. An Impala daemon (impalad) runs on each machine and is responsible for receiving queries from users and for orchestrating the execution of queries.
  2. 20 |
  3. A single Statestore daemon (statestored) is a pub/sub system used to disseminate system metadata asynchronously to clients. The statestore has weak semantics and doesn't persist anything to disk.
  4. 21 |
  5. A single Catalog daemon (catalogd) publishes catalog information through the statestored. The catalogd pulls in metadata from external systems, puts it in Impala form, and pushes it through the statestored.
  6. 22 |
23 |

Impala has a Java frontend that performs the typical database frontend operations (e.g. parsing, semantic analysis, and query optimization). It uses a two phase query planner.

24 |
    25 |
  1. Single node planning. First, a single-node non-executable query plan tree is formed. Typical optimizations like join reordering are performed.
  2. 26 |
  3. Plan parallelization. After a single node plan is formed, it is fragmented and divided between multiple nodes with the goal of minimizing data movement and maximizing scan locality.
  4. 27 |
28 |

Impala has a C++ backed that uses Volcano style iterators with exchange operators and runtime code generation using LLVM. To efficiently read data from disk, Impala bypasses the traditional HDFS protocols. The backend supports a lot of different file formats including Avro, RC, sequence, plain test, and Parquet.

29 |

For cluster and resource management, Impala uses a home grown Llama system that sits over YARN.

30 |
31 | 32 | 33 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /html/lehman1999t.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

T Spaces: The Next Wave (1999)

15 |

Summary. T Spaces is a

16 |
17 |

tuplespace-based network communication buffer with database capabilities that enables communication between applications and devices in a network of heterogeneous computers and operating systems

18 |
19 |

Essentially, it's Linda++; it implements a Linda tuplespace with a couple new operators and transactions.

20 |

The paper begins with a history of related tuplespace based work. The notion of a shared collaborative space originated from AI blackboard systems popular in the 1970's, the most famous of which was the Hearsay-II system. Later, the Stony Brook microcomputer Network (SBN), a cluster organized in a torus topology, was developed at Stony Brook, and Linda was invented to program it. Over time, the domain in which tuplespaces were popular shifted from parallel programming to distributed programming, and a huge number of Linda-like systems were implemented.

21 |

T Spaces is the marriage of tuplespaces, databases, and Java.

22 |
    23 |
  • Tuplespaces provide a flexible communication model;
  • 24 |
  • databases provide stability, durability, and advanced querying; and
  • 25 |
  • Java provides portability and flexibility.
  • 26 |
27 |

T Spaces implements a Linda tuplespace with a few improvements:

28 |
    29 |
  • In addition to the traditional Read, Write, Take, WaitToRead, and WaitToTake operators, T Spaces also introduces a Scan/ConsumingScan operator to read/take all tuples matched by a query and a Rhonda operator to exchange tuples between processes.
  • 30 |
  • Users can also dynamically register new operators, the implementation of which takes of advantage of Java.
  • 31 |
  • Fields of tuples are indexed by name, and tuples can be queried by named value. For example, the query (foo = 8) returns all tuples (of any type) with a field foo equal to 8. These indexes are similar to the inversions implemented in Phase 0 of System R.
  • 32 |
  • Range queries are supported.
  • 33 |
  • To avoid storing large values inside of tuples, files URLs can instead be stored, and T Spaces transparently handles locating and transferring the file contents.
  • 34 |
  • T Spaces implements a group based ACL form of authorization.
  • 35 |
  • T Spaces supports transactions.
  • 36 |
37 |

To evaluate the expressiveness and performance of T Spaces, the authors implement a collaborative web-crawling application, a web-search information delivery system, and a universal information appliance.

38 |
39 | 40 | 41 | 50 | 51 | 52 | -------------------------------------------------------------------------------- /html/letia2009crdts.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

CRDTs: Consistency without concurrency control (2009)

15 |

Overview. Concurrently updating distributed mutable data is a challenging problem that often requires expensive coordination. Commutative replicated data types (CRDTs) are data types with commutative update operators that can provide convergence without coordination. Moreover, non-trivial CRTDs exist; this paper presents Treedoc: an ordered set CRDT.

16 |

Ordered Set CRDT. An ordered set CRDT represents an ordered sequence of atoms. Atoms are associated with IDs with five properties: 1. Two replicas of the same atom have the same ID. 2. No two atoms have the same ID. 3. IDs are constant for the lifetime of an ordered set. 4. IDs are totally ordered. 5. The space if IDs is dense. That is for all IDS P and F where P < F, there exists an ID N such that P < N < F.

17 |

The ordered set supports two operations:

18 |
    19 |
  1. insert(ID, atom)
  2. 20 |
  3. delete(ID)
  4. 21 |
22 |

where atoms are ordered by their corresponding ID. Concretely, Treedoc is represented as a tree, and IDs are paths in the tree ordered by an infix traversal. Nodes in the tree can be major nodes and contain multiple mini-nodes where each mini-node is annotated with a totally ordered disambiguator unique to each node.

23 |

Deletes simply tombstone a node. Inserts work like a typical binary tree insertion. To avoid the tree and IDs from getting too large, the tree can periodically be flattened: the tree is restructured into an array of nodes. This operation does not commute nicely with others, so a coordination protocol like 2PC is required.

24 |

Treedoc in the Large Scale. Requiring a coordination protocol for flattening doesn't scale and runs against the spirit of CRDTs. It also doesn't handle churn well. Instead, nodes can be partitioned into a set of core nodes and a set of nodes in the nebula. The core nodes coordinate while the nebula nodes lag behind. There are protocols for nebula nodes to catch up to the core nodes.

25 |
26 | 27 | 28 | 37 | 38 | 39 | -------------------------------------------------------------------------------- /html/li2014automating.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Automating the Choice of Consistency Levels in Replicated Systems (2014)

15 |

Geo-replicated data stores that replicate data have to choose between incurring the performance overheads of implementing strong consistency or the brain boggling semantics of weak consistency. Some data stores allow users to make this decision on a fine grained level, allowing some operations to operate with strong consistency while other operations operate under weak consistency. For example, Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary introduced RedBlue consistency in which users annotate operations as red (strong) or blue (weak). Typically, the process of choosing the consistency level for each operation has been a manual task. This paper builds off of RedBlue consistency and automates the process of labeling operations as red or blue.

16 |

Overview. When using RedBlue consistency, users can decompose operations into generator operations and shadow operations to improve commutativity. This paper automates this process by converting Java applications that write state to a relational database to issue operations against CRDTs. Moreover, it uses a combination of static analysis and runtime checks to ensure that operations are invariant confluent.

17 |

Generating Shadow Operations. Relations are just sets of tuples, so we can model them as CRDT sets. Moreover, we can model individual fields as CRDTs. This paper presents field CRDTs (PN-Counters, LWW-registers) and set CRDTs to model relations. Users then annotate SQL create statements indicating which CRDT to use.

18 |

Using a custom JDBC driver, user operations can be converted into a corresponding shadow operation that issues CRDT operations.

19 |

Classification of Shadow Operations. We want to find the database states and transaction arguments that guarantee invariant-confluence. This can be a very difficult (probably undecidable) problem in general. Instead, we simplify the problem by considering the possible trace of CRDT operations through a program. Even with this simplifying assumption, tracing execution through loops can be challenging. We can again simplify things by only considering loop where each iteration is independent of one another.

20 |

We model each program as a regular expression over statements. We then unroll the regular expression to get the set of all execution traces. Using these traces we can construct a map from traces, or templates, to weakest preconditions that ensure invariant confluence.

21 |

At runtime, we need only look up the weakest precondition given an execution trace and check that it is true.

22 |

Implementation. The system is implemented as 15K lines of Java and 533 lines of OCaml. It builds off of MySQL, an existing Java parser, and Gemini.

23 |
24 | 25 | 26 | 35 | 36 | 37 | -------------------------------------------------------------------------------- /html/mckusick1984fast.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

A Fast File System for UNIX (1984)

15 |

The Fast Filesystem (FFS) improved the read and write throughput of the original Unix file system by 10x by

16 |
    17 |
  1. increasing the block size,
  2. 18 |
  3. dividing blocks into fragments, and
  4. 19 |
  5. performing smarter allocation.
  6. 20 |
21 |

The original Unix file system, dubbed "the old file system", divided disk drives into partitions and loaded a file system on to each partition. The filesystem included a superblock containing metadata, a linked list of free data blocks known as the free list, and an inode for every file. Notably, the file system was composed of 512 byte blocks; no more than 512 bytes could be transfered from the disk at once. Moreover, the file system had poor data locality. Files were often sprayed across the disk requiring lots of random disk accesses.

22 |

The "new file system" improved performance by increasing the block size to any power of two at least as big as 4096 bytes. In order to handle small files efficiently and avoid high internal fragmentation and wasted space, blocks were further divided into fragments at least as large as the disk sector size.

23 |
      +------------+------------+------------+------------+
24 | block | fragment 1 | fragment 2 | fragment 3 | fragment 4 |
25 |       +------------+------------+------------+------------+
26 |

Files would occupy as many complete blocks as possible before populating at most one fragmented block.

27 |

Data was also divided into cylinder groups where each cylinder group included a copy of the superblock, a list of inodes, a bitmap of available blocks (as opposed to a free list), some usage statistics, and finally data blocks. The file system took advantage of hardware specific information to place data at rotational offsets specific to the hardware so that files could be read with as little delay as possible. Care was also taken to allocate files contiguously, similar files in the same cylinder group, and all the inodes in a directory together. Moreover, if the amount of available space gets too low, then it becomes more and more difficult to allocate blocks efficiently. For example, it becomes hard to allocate the files of a block contiguously. Thus, the system always tries to keep ~10% of the disk free.

28 |

Allocation is also improved in the FFS. A top level global policy uses file system wide information to decide where to put new files. Then, a local policy places the blocks. Care must be taken to colocate blocks that are accessed together, but crowding a single cyclinder group can exhaust its resources.

29 |

In addition to performance improvements, FFS also introduced

30 |
    31 |
  1. longer filenames,
  2. 32 |
  3. advisory file locks,
  4. 33 |
  5. soft links,
  6. 34 |
  7. atomic file renaming, and
  8. 35 |
  9. disk quota enforcement.
  10. 36 |
37 |
38 | 39 | 40 | 49 | 50 | 51 | -------------------------------------------------------------------------------- /html/moore2006inferring.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Inferring Internet Denial-of-Service Activity (2001)

15 |

Summary. This paper uses backscatter analysis to quantitatively analyze denial-of-service attacks on the Internet. Most flooding denial-of-service attacks involve IP spoofing, where each packet in an attack is given a faux IP address drawn uniformly at random from the space of all IP addresses. If the packet elicits the victim to issue a reply packet, then victims of denial-of-service attacks end up sending unsolicited messages to servers uniformly at random. By monitoring this backscatter at enough hosts, one can infer the number, intensity, and type of denial-of-service attacks.

16 |

There are of course a number of assumptions upon which backscatter depends.

17 |
    18 |
  1. Address uniformity. It is assumed that DOS attackers spoof IP addresses uniformally at random.
  2. 19 |
  3. Reliable delivery. It is assumed that packets, as part of the attack and response, are delivered reliably.
  4. 20 |
  5. Backscatter hypothesis. It is assumed that unsolicited packets arriving at a host are actually part of backscatter.
  6. 21 |
22 |

The paper performs a backscatter analysis on 1/256 of the IPv4 address space. They cluster the backscatter data using a flow-based classification to measure individual attacks and using an event-based classification to measure the intensity of attacks. The findings of the analysis are best summarized by the paper.

23 |
24 | 25 | 26 | 35 | 36 | 37 | -------------------------------------------------------------------------------- /html/ritchie1978unix.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

The Unix Time-Sharing System (1974)

15 |

Unix was an operating system developed by Dennis Ritchie, Ken Thompson, and others at Bell Labs. It was the successor to Multics and is probably the single most influential piece of software ever written.

16 |

Earlier versions of Unix were written in assembly, but the project was later ported to C: probably the single most influential programming language ever developed. This resulted in a 1/3 increase in size, but the code was much more readable and the system included new features, so it was deemed worth it.

17 |

The most important feature of Unix was its file system. Ordinary files were simple arrays of bytes physically stored as 512-byte blocks: a rather simple design. Each file was given an inumber: an index into an ilist of inodes. Each inode contained metadata about the file and pointers to the actual data of the file in the form of direct and indirect blocks. This representation made it easy to support (hard) linking. Each file was protected with 9 bits: the same protection model Linux uses today. Directories were themselves files which stored mappings from filenames to inumbers. Devices were modeled simply as files in the /dev directory. This unifying abstraction allowed devices to be accessed with the same API. File systems could be mounted using the mount command. Notably, Unix didn't support user level locking, as it was neither necessary nor sufficient.

18 |

Processes in Unix could be created using a fork followed by an exec, and processes could communicate with one another using pipes. The shell was nothing more than an ordinary process. Unix included file redirection, pipes, and the ability to run programs in the background. All this was implemented using fork, exec, wait, and pipes.

19 |

Unix also supported signals.

20 |
21 | 22 | 23 | 32 | 33 | 34 | -------------------------------------------------------------------------------- /html/saltzer1984end.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

End-to-End Arguments in System Design (1984)

15 |

This paper presents the end-to-end argument:

16 |
17 |

The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the end points of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement.)

18 |
19 |

which says that in a layered system, functionality should, nay must be implemented as close to the application as possible to ensure correctness (and usually also performance).

20 |

The end-to-end argument is motivated by an example file transfer scenario in which host A transfers a file to host B. Every step of the file transfer presents an opportunity for failure. For example, the disk may silently corrupt data or the network may reorder or drop packets. Any attempt by one of these subsystems to ensure reliable delivery is wasted effort since the delivery may still fail in another subsystem. The only way to guarantee correctness is to have the file transfer application check for correct delivery itself. For example, once it receives the entire file, it can send the file's checksum back to host A to confirm correct delivery.

21 |

In addition to being necessary for correctness, applying the end-to-end argument also usually leads to improved performance. When a functionality is implemented in a lower level subsystem, every application built on it must pay the cost, even if it does not require the functionality.

22 |

There are numerous other examples of the end-to-end argument:

23 |
    24 |
  • Guaranteed packet delivery.
  • 25 |
  • Secure data transmission.
  • 26 |
  • Duplicate message suppression.
  • 27 |
  • FIFO delivery.
  • 28 |
  • Transaction management.
  • 29 |
  • RISC.
  • 30 |
31 |

The end-to-end argument is not a hard and fast rule. In particular, it may be eschewed when implementing a functionality in a lower level can lead to performance improvements. Consider again the file transfer protocol above and assume the network drops one in every 100 packets. As the file becomes longer, the odds of a successful delivery become increasingly small making it prohibitively expensive for the application alone to ensure reliable delivery. The network may be able to perform a small amount of work to help guarantee reliable delivery making the file transfer more efficient.

32 |
33 | 34 | 35 | 44 | 45 | 46 | -------------------------------------------------------------------------------- /html/vavilapalli2013apache.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Apache Hadoop YARN: Yet Another Resource Negotiator (2013)

15 |

Summary. Hadoop began as a MapReduce clone designed for large scale web crawling. As big data became trendy and data became... big, Hadoop became the de facto standard data processing system, and large Hadoop clusters were installed in many companies as "the" cluster. As application requirements evolved, users started abusing the large Hadoop in unintended ways. For example, users would submit map-only jobs which were thinly guised web servers. Apache Hadoop YARN is a cluster manager that aims to disentangle cluster management from programming paradigm and has the following goals:

16 |
    17 |
  • Scalability
  • 18 |
  • Multi-tenancy
  • 19 |
  • Serviceability
  • 20 |
  • Locality awareness
  • 21 |
  • High cluster utilization
  • 22 |
  • Reliability/availability
  • 23 |
  • Secure and auditable operation
  • 24 |
  • Support for programming model diversity
  • 25 |
  • Flexible resource model
  • 26 |
  • Backward compatibility
  • 27 |
28 |

YARN is orchestrated by a per-cluster Resource Manager (RM) that tracks resource usage and node liveness, enforces allocation invariants, and arbitrates contention among tenants. Application Masters (AM) are responsible for negotiating resources with the RM and manage the execution of single job. AMs send ResourceRequests to the RM telling it resource requirements, locality preferences, etc. In return, the RM hands out containers (e.g. <2GB RAM, 1 CPU>) to AMs. The RM also communicates with Node Managers (NM) running on each node which are responsible for measuring node resources and managing (i.e. starting and killing) tasks. When a user want to submit a job, it sends it to the RM which hands a capability to an AM to present to an NM. The RM is a single point of failure. If it fails, it restores its state from disk and kills all running AMs. The AMs are trusted to be faul-tolerant and resubmit any prematurely terminated jobs.

29 |

YARN is deployed at Yahoo where it manages roughly 500,000 daily jobs. YARN supports frameworks like Hadoop, Tez, Spark, Dryad, Giraph, and Storm.

30 |
31 | 32 | 33 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /html/vogels2009eventually.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Eventually Consistent (2009)

15 |

In this CACM article, Werner Vogels discusses eventual consistency as well as a couple other forms of consistency.

16 |

Historical Perspective

17 |

Back in the day, when people were thinking about how to build distributed systems, they thought that strong consistency was the only option. Anything else would just be incorrect, right? Well, fast forward to the 90's and availability started being favored over consistency. In 2000, Brewer released the CAP theorem unto the world, and weak consistency really took off.

18 |

Client's Perspective of Consistency

19 |

There is a zoo of consistency from the perspective of a client.

20 |
    21 |
  • Strong consistency. A data store is strongly consistent if after a write completes, it is visible to all subsequent reads.
  • 22 |
  • Weak consistency. Weak consistency is a synonym for no consistency. Your reads can return garbage.
  • 23 |
  • Eventual consistency. "the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value."
  • 24 |
  • Causal consistency. If A issues a write and then contacts B, B's read will see the effect of A's write since it causally comes after it. Causally unrelated reads can return whatever they want.
  • 25 |
  • Read-your-writes consistency. Clients read their most recent write or a more recent version.
  • 26 |
  • Session consistency. So long as a client maintains a session, it gets read-your-writes consistency.
  • 27 |
  • Monotonic read consistency. Over time, a client will see increasingly more fresh versions of data.
  • 28 |
  • Monotonic write consistency. A client's writes will be executed in the order in which they are issued.
  • 29 |
30 |

Server's Perspective of Consistency

31 |

Systems can implement consistency using quorums in which a write is sent to W of the N replicas, and a read is sent to R of the N replicas. If R + W > N, then we have strong consistency.

32 |
33 | 34 | 35 | 44 | 45 | 46 | -------------------------------------------------------------------------------- /html/zaharia2012resilient.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Papers 5 | 6 | 7 | 8 | 9 | 10 | 13 |
14 |

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (2012)

15 |

Frameworks like MapReduce made processing large amounts of data easier, but they did not leverage distributed memory. If a MapReduce was run iteratively, it would write all of its intermediate state to disk: something that was prohibitively slow. This limitation made batch processing systems like MapReduce ill-suited to iterative (e.g. k-means clustering) and interactive (e.g. ad-hoc queries) workflows. Other systems like Pregel did take advantage of distributed memory and reused the in-memory data across computations, but the systems were not general-purpose.

16 |

Spark uses Resilient Distributed Datasets (RDDs) to perform general computations in memory. RDDs are immutable partitioned collections of records. Unlike pure distributed shared memory abstractions which allow for arbitrary fine-grained writes, RDDs can only be constructed using coarse-grained transformations from on-disk data or other RDDs. This weaker abstraction can be implemented efficiently. Spark also uses RDD lineage to implement low-overhead fault tolerance. Rather than persist intermediate datasets, the lineage of an RDD can be persisted and efficiently recomputed. RDDs could also be checkpointed to avoid the recomputation of a long lineage graph.

17 |

Spark has a Scala-integrated API and comes with a modified interactive interpreter. It also includes a large number of useful transformations (which construct RDDs) and actions (which derive data from RDDs). Users can also manually specify RDD persistence and partitioning to further improve performance.

18 |

Spark subsumed a huge number of existing data processing frameworks like MapReduce and Pregel in a small amount of code. It was also much, much faster than everything else on a large number of applications.

19 |
20 | 21 | 22 | 31 | 32 | 33 | -------------------------------------------------------------------------------- /images/dan2017using_fork.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mwhittaker/papers/b37f7047b911a1abe9a52b85654d0ff8bcc10c0f/images/dan2017using_fork.png -------------------------------------------------------------------------------- /images/dan2017using_zigzag.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mwhittaker/papers/b37f7047b911a1abe9a52b85654d0ff8bcc10c0f/images/dan2017using_zigzag.png -------------------------------------------------------------------------------- /js/mathjax_config.js: -------------------------------------------------------------------------------- 1 | window.MathJax = { 2 | tex2jax: { 3 | inlineMath: [['$','$'], ['\\(','\\)']], 4 | skipTags: ['script', 'noscript', 'style', 'textarea'], 5 | }, 6 | TeX: { 7 | Macros: { 8 | set: ["\\left\\{#1\\right\\}", 1], 9 | setst: ["\\left\\{#1 \\,\\middle|\\, #2\\right\\}", 2], 10 | parens: ["\\left(#1\\right)", 1], 11 | reals: "\\mathbb{R}", 12 | nats: "\\mathbb{N}", 13 | rats: "\\mathbb{Q}", 14 | ints: "\\mathbb{Z}", 15 | defeq: "\\triangleq", 16 | coleq: "\\texttt{:-}", 17 | norm: ["\\left\\lVert#1\\right\\rVert", 1], 18 | twonorm: ["\\norm{#1}_2", 1], 19 | }, 20 | }, 21 | }; 22 | -------------------------------------------------------------------------------- /papers/adya2000generalized.md: -------------------------------------------------------------------------------- 1 | ## [Generalized Isolation Level Definitions (2000)](TODO) ## 2 | **Summary.** 3 | In addition to serializability, ANSI SQL-92 defined a set of weaker isolation 4 | levels that applications could use to improve performance at the cost of 5 | consistency. The definitions were implementation-independent but ambiguous. 6 | Berenson et al. proposed a revision of the isolation level definitions that was 7 | unambiguous but specific to locking. Specifically, they define a set of 8 | *phenomena*: 9 | 10 | - P0: `w1(x) ... w2(x) ...` *"dirty write"* 11 | - P1: `w1(x) ... r2(x) ...` *"dirty read"* 12 | - P2: `r1(x) ... w2(x) ...` *"unrepeatable read"* 13 | - P3: `r1(P) ... w2(y in P) ...` *"phantom read"* 14 | 15 | and define the isolation levels according to which phenomena they preclude. 16 | This preclusion can be implemented by varying how long certain types of locks 17 | are held: 18 | 19 | | write locks | read locks | phantom locks | precluded | 20 | | ----------- | ---------- | ------------- | -------------- | 21 | | short | short | short | P0 | 22 | | long | short | short | P0, P1 | 23 | | long | long | short | P0, P1, P2 | 24 | | long | long | long | P0, P1, P2, P3 | 25 | 26 | This locking-specific *preventative* approach to defining isolation levels, 27 | while unambiguous, rules out many non-locking implementations of concurrency 28 | control. Notably, it does not allow for multiversioning and does not allow 29 | non-committed transactions to experience weaker consistency than committed 30 | transactions. Moreover, many isolation levels are naturally expressed as 31 | invariants between multiple objects, but these definitions are all over a 32 | single object. 33 | 34 | This paper introduces implementation-independent unambiguous isolation level 35 | definitions. The definitions also include notions of predicates at all levels. 36 | It does so by first introducing the definition of a *history* as a partial 37 | order of read/write/commit/abort events and total order of commited object 38 | versions. It then introduces three dependencies: *read-dependencies*, 39 | *anti-dependencies*, and *write-dependencies* (also known as write-read, 40 | read-write, and write-write dependencies). Next, it describes how to construct 41 | dependency graph and defines isolation levels as constraints on these graphs. 42 | 43 | For example, the G0 phenomenon says that a dependency graph contains a 44 | write-dependency cycle. PL-1 is the isolation level that precludes G0. 45 | Similarly, the G1 phenomenon says that either 46 | 47 | 1. a committed transaction reads an aborted value, 48 | 2. a committed transaction reads an intermediate value, or 49 | 3. there is a write-read/write-write cycle. 50 | 51 | The PL-2 isolation level precludes G1 (and therefore G0) and corresponds 52 | roughly to the READ-COMMITTED isolation level. 53 | -------------------------------------------------------------------------------- /papers/agrawal1987concurrency.md: -------------------------------------------------------------------------------- 1 | # [Concurrency Control Performance Modeling: Alternatives and Implications (1987)](https://scholar.google.com/scholar?cluster=9784855600346107276&hl=en&as_sdt=0,5) 2 | ## Overview 3 | There are three types of concurrency control algorithms: locking algorithms, 4 | timestamp based algorithms, optimistic algorithms. There have been a large 5 | number of performance analyses aimed at deciding which type of concurrency 6 | algorithm is best, however despite the abundance of analyses, there is no 7 | definitive winner. Different analyses have contradictory results, largely 8 | because there is no standard performance model or set of assumptions. This 9 | paper presents a complete database model for evaluating the performance of 10 | concurrency control algorithms and discusses how varying assumptions affect the 11 | performance of various algorithms. 12 | 13 | ## Performance Model 14 | This paper analyzes three specific concurrency control mechanisms, 15 | 16 | - **Blocking.** Transactions acquire locks before they access a data item. 17 | Whenever a transaction acquires a lock, deadlock detection is run. If a 18 | deadlock is detected, the youngest transaction is aborted. 19 | - **Immediate-restart.** Again, transactions acquire locks, but instead of 20 | blocking if a lock cannot be immediately acquired, the transaction is instead 21 | aborted and restarted with delay. This delay is adjusted dynamically to be 22 | roughly equal to the average transaction duration. 23 | - **Optimistic.** Transactions do not acquire locks. A transaction is aborted 24 | when it goes to commit if it read any objects that had been written and 25 | committed since the transaction began. 26 | 27 | using a closed queueing model of a single-site database. Essentially, 28 | transactions come in, sit in some queues, and are controlled by a concurrency 29 | control algorithm. The model has a number of parameters, some of which are held 30 | constant for all the experiments and some of which are varied from experiment 31 | to experiment. Some of the parameters had to be tuned to get interesting 32 | result. For example, it was found that with a large database and few conflicts, 33 | all concurrency control algorithms performed roughly the same and scaled with 34 | the degree of parallelism. 35 | 36 | ## Resource-Related Assumptions 37 | Some analyses assume infinite resources. How do these assumptions affect 38 | concurrency control performance? 39 | 40 | - **Experiment 1: Infinite Resources.** Given infinite resources, higher 41 | degrees of parallelism lead to higher likelihoods of transaction conflict 42 | which in turn leads to higher likelihoods of transaction abort and restart. 43 | The blocking algorithm thrashes because of these increased conflicts. The 44 | immediate-restart algorithm plateaus because the dynamic delay effectively 45 | limits the amount of parallelism. The optimistic algorithm does well because 46 | aborted transactions are immediately replaced with other transactions. 47 | - **Experiment 2: Limited Resources.** With a limited number of resources, all 48 | three algorithms thrash, but blocking performs best. 49 | - **Experiment 3: Multiple Resources.** The blocking algorithm performs best up 50 | to about 25 resource units (i.e. 25 CPUs and 50 disks); after that, the 51 | optimistic algorithm performs best. 52 | - **Experiment 4: Interactive Workloads.** When transactions spend more time 53 | "thinking", the system begins to behave more like it has infinite resources 54 | and the optimistic algorithm performs best. 55 | 56 | ## Transaction Behavior Assumptions 57 | 58 | - **Experiment 6: Modeling Restarts.** Some analyses model a transaction 59 | restart as the spawning of a completely new transaction. These fake restarts 60 | lead to higher throughput because they avoid repeated transaction conflict. 61 | - **Experiment 7: Write-Lock Acquisition.** Some analyses have transactions 62 | acquire read-locks and then upgrade them to write-locks. Others have 63 | transactions immediately acquire write-locks if the object will ever be 64 | written to. Upgrading locks can lead to deadlock if two transactions 65 | concurrently write to the same object. The effect of lock upgrading varies 66 | with the amount of available resources. 67 | -------------------------------------------------------------------------------- /papers/alvaro2010boom.md: -------------------------------------------------------------------------------- 1 | ## [BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud (2010)](TODO) ## 2 | **Summary.** 3 | Programming distributed systems is hard. Really hard. This paper conjectures 4 | that *data-centric* programming done with *declarative programming languages* 5 | can lead to simpler distributed systems that are more correct with less code. 6 | To support this conjecture, the authors implement an HDFS and Hadoop clone in 7 | Overlog, dubbed BOOM-FS and BOOM-MR respectively, using orders of magnitude 8 | fewer lines of code that the original implementations. They also extend BOOM-FS 9 | with increased availability, scalability, and monitoring. 10 | 11 | An HDFS cluster consists of NameNodes responsible for metadata management and 12 | DataNodes responsible for data management. BOOM-FS reimplements the metadata 13 | protocol in Overlog; the data protocol is implemented in Java. The 14 | implementation models the entire system state (e.g. files, file paths, 15 | heartbeats, etc.) as data in a unified way by storing them as collections. The 16 | Overlog implementation of the NameNode then operates on and updates these 17 | collections. Some of the data (e.g. file paths) are actually views that can be 18 | optionally materialized and incrementally maintained. After reaching (almost) 19 | feature parity with HDFS, the authors increased the availability of the 20 | NameNode by using Paxos to introduce a hot standby replicas. They also 21 | partition the NameNode metadata to increase scalability and use metaprogramming 22 | to implement monitoring. 23 | 24 | BOOM-MR plugs in to the existing Hadoop code but reimplements two MapReduce 25 | scheduling algorithms: Hadoop's first-come first-server algorithm, and 26 | Zaharia's LATE policy. 27 | 28 | BOOM Analytics was implemented in order of magnitude fewer lines of code thanks 29 | to the data-centric approach and the declarative programming language. The 30 | implementation is also almost as fast as the systems they copy. 31 | -------------------------------------------------------------------------------- /papers/alvaro2011consistency.md: -------------------------------------------------------------------------------- 1 | ## [Consistency Analysis in Bloom: a CALM and Collected Approach (2011)](https://scholar.google.com/scholar?cluster=9165311711752272482&hl=en&as_sdt=0,5) ## 2 | **Summary.** 3 | Strong consistency eases reasoning about distributed systems, but it requires 4 | coordination which entails higher latency and unavailability. Adopting weaker 5 | consistency models can improve system performance but writing applications 6 | against these weaker guarantees can be nuanced, and programmers don't have any 7 | reasoning tools at their disposal. This paper introduces the *CALM conjecture* 8 | and *Bloom*, a disorderly declarative programming language based on CALM, which 9 | allows users to write loosely consistent systems in a more principled way. 10 | 11 | Say we've eschewed strong consistency, how do we know our program is even 12 | eventually consistent? For example, consider a distributed register in which 13 | servers accept writes on a first come first serve basis. Two clients could 14 | simultaneously write two distinct values `x` and `y` concurrently. One server 15 | may receive `x` then `y`; the other `y` then `x`. This system is not eventually 16 | consistent. Even after client requests have quiesced, the distributed register 17 | is in an inconsistent state. The *CALM conjecture* embraces Consistency As 18 | Logical Monotonicity and says that logically monotonic programs are eventually 19 | consistent for any ordering and interleaving of message delivery and 20 | computation. Moreover, they do not require waiting or coordination to stream 21 | results to clients. 22 | 23 | *Bud* (Bloom under development) is the realization of the CALM theorem. It is 24 | Bloom implemented as a Ruby DSL. Users declare a set of Bloom collections (e.g. 25 | persistent tables, temporary tables, channels) and an order independent set of 26 | declarative statements similar to Datalog. Viewing Bloom through the lens of 27 | its procedural semantics, Bloom execution proceeds in timesteps, and each 28 | timestep is divided into three phases. First, external messages are delivered 29 | to a node. Second, the Bloom statements are evaluated. Third, messages are sent 30 | off elsewhere. Bloom also supports modules and interfaces to improve 31 | modularity. 32 | 33 | The paper also implements a key-value store and shopping cart using Bloom and 34 | uses various visualization tools to guide the design of coordination-free 35 | implementations. 36 | -------------------------------------------------------------------------------- /papers/armbrust2015spark.md: -------------------------------------------------------------------------------- 1 | ## [Spark SQL: Relational Data Processing in Spark (2015)](https://scholar.google.com/scholar?cluster=12543149035101013955&hl=en&as_sdt=0,5) 2 | **Summary.** 3 | Data processing frameworks like MapReduce and Spark can do things that 4 | relational databases can't do very easily. For example, they can operate over 5 | semi-structured or unstructured data, and they can perform advanced analytics. 6 | On the other hand, Spark's API allows user to run arbitrary code (e.g. 7 | `rdd.map(some_arbitrary_function)`) which prevents Spark from performing 8 | certain optimizations. Spark SQL marries imperative Spark-like data processing 9 | with declarative SQL-like data processing into a single unified interface. 10 | 11 | Spark's main abstraction was an RDD. Spark SQL's main abstraction is a 12 | *DataFrame*: the Spark analog of a table which supports a nested data model of 13 | standard SQL types as well as structs, arrays, maps, unions, and user defined 14 | types. DataFrames can be manipulated as if they were RDDs of row objects (e.g. 15 | `dataframe.map(row_func)`), but they also support a set of standard relational 16 | operators which take ASTs, built using a DSL, as arguments. For example, the 17 | code `users.where(users("age") < 40)` constructs an AST from `users("age") < 18 | 40` as an argument to filter the `users` DataFrame. By passing in ASTs as 19 | arguments rather than arbitrary user code, Spark is able to perform 20 | optimizations it previously could not do. DataFrames can also be queries using 21 | SQL. 22 | 23 | Notably, integrating queries into an existing programming language (e.g. Scala) 24 | makes writing queries much easier. Intermediate subqueries can be reused, 25 | queries can be constructed using standard control flow, etc. Moreover, Spark 26 | eagerly typechecks queries even though their execution is lazy. Furthermore, 27 | Spark SQL allows users to create DataFrames of language objects (e.g. Scala 28 | objects), and UDFs are just normal Scala functions. 29 | 30 | DataFrame queries are optimized and manipulated by a new extensible query 31 | optimizer called *Catalyst*. The query optimizer manipulates ASTs written in 32 | Scala using *rules*, which are just functions from trees to trees that 33 | typically use pattern matching. Queries are optimized in four phases: 34 | 35 | 1. *Analysis.* First, relations and columns are resolved, queries are 36 | typechecked, etc. 37 | 2. *Logical optimization.* Typical logical optimizations like constant folding, 38 | filter pushdown, boolean expression simplification, etc are performed. 39 | 3. *Physical planning.* Cost based optimization is performed. 40 | 4. *Code generation.* Scala quasiquoting is used for code generation. 41 | 42 | Catalyst also makes it easy for people to add new data sources and user defined 43 | types. 44 | 45 | Spark SQL also supports schema inference, ML integration, and query federation: 46 | useful features for big data. 47 | 48 | -------------------------------------------------------------------------------- /papers/bailis2013highly.md: -------------------------------------------------------------------------------- 1 | ## [Highly Available Transactions: Virtues and Limitations (2014)](TODO) ## 2 | **Summary.** 3 | Serializability is the gold standard of consistency, but databases have always 4 | provided weaker consistency modes (e.g. Read Committed, Repeatable Read) that 5 | promise improved performance. In this paper, Bailis et al. determine which of 6 | these weaker consistency models can be implemented with high availability. 7 | 8 | First, why is high availability important? 9 | 10 | 1. *Partitions.* Partitions happen, and when they do non-available systems 11 | become, well, unavailable. 12 | 2. *Latency.* Partitions may be transient, but latency is forever. Highly 13 | available systems can avoid latency by eschewing coordination costs. 14 | 15 | Second, are weaker consistency models consistent enough? In short, yeah 16 | probably. In a survey of databases, Bailis finds that many do not employ 17 | serializability by default and some do not even provide full serializability. 18 | Bailis also finds that four of the five transactions in the TPC-C benchmark can 19 | be implemented with highly available transactions. 20 | 21 | After defining availability, Bailis presents the taxonomy of which consistency 22 | can be implemented as HATs, and also argues why some fundamentally cannot. He 23 | also performs benchmarks on AWS to show the performance benefits of HAT. 24 | -------------------------------------------------------------------------------- /papers/bailis2014coordination.md: -------------------------------------------------------------------------------- 1 | ## [Coordination Avoidance in Database Systems (2014)](https://scholar.google.com/scholar?cluster=428435405994413003&hl=en&as_sdt=0,5) 2 | **Overivew.** 3 | Coordination in a distributed system is sometimes necessary to maintain 4 | application correctness, or *consistency*. For example, a payroll application 5 | may require that each employee has a unique ID, or that a jobs relation only 6 | include valid employees. However, coordination is not cheap. It increases 7 | latency, and in the face of partitions can lead to unavailability. Thus, when 8 | application correctness permits, coordination should be avoided. This paper 9 | develops the necessary and sufficient conditions for when coordination is 10 | needed to maintain a set of database invariance using a notion of 11 | invariant-confluence or I-confluence. 12 | 13 | **System Model.** 14 | A *database state* is a set D of object versions drawn from the set of all 15 | states *D*. Transactions operate on *logical replicas* in *D* that contain the 16 | set of object versions relevant to the transaction. Transactions are modeled as 17 | functions T : *D* -> *D*. The effects of a transaction are merged into an 18 | existing replica using an associative, commutative, idempotent merge operator. 19 | Changes are shared between replicas and merges likewise. In this paper, merge 20 | is set union, and we assume we know all transactions in advance. Invariants are 21 | modeled as boolean functions I: *D* -> 2. A state R is said to be I-valid if 22 | I(R) is true. 23 | 24 | We say a system has *transactional availability* if whenever a transaction T 25 | can contact servers with the appropriate data in T, it only aborts if T chooses 26 | to abort. We say a system is *convergent* if after updates quiesce, all servers 27 | eventually have the same state. A system is globally *I-valid* if all replicas 28 | always have I-valid states. A system provides coordination-free execution if 29 | execution of a given transaction does not depend on the execution of others. 30 | 31 | **Consistency Sans Coordination.** 32 | A state Si is *I-T-Reachable* if its derivable from I-valid states with 33 | transactions in T. A set of transactions is *I*-confluent with respect to 34 | invariant I if for all I-T-Reachable states Di, Dj with common ancestor Di join 35 | Dj is I-valid. A globally I-valid system can execute a set of transactions T 36 | with global validity, transactional availability, convergence, and 37 | coordination-freedom if and only if T is I-confluent with respect to I. 38 | 39 | **Applying Invariant-Confluence.** 40 | I-confluence can be applied to existing relation operators and constraints. For 41 | example, updates, inserts, and deletes are I-confluent with respect to 42 | per-record inequality constraint. Deletions are I-confluent with respect to 43 | foreign key constraints; additions and updates are not. I-confluence can also 44 | be applied to abstract data types like counters. 45 | 46 | -------------------------------------------------------------------------------- /papers/barham2003xen.md: -------------------------------------------------------------------------------- 1 | ## [Xen and the Art of Virtualization (2003)](https://scholar.google.com/scholar?cluster=11605682627859750448&hl=en&as_sdt=0,5) 2 | **Summary.** 3 | Many virtual machine monitors, or *hypervisors*, aim to run unmodified guest 4 | operating systems by presenting a completely virtual machine. This lets any OS 5 | run on the hypervisor but comes with a significant performance penalty. Xen is 6 | an x86 hypervisor that uses *paravirtualization* to reduce virtualization 7 | overheads. Unlike with full virtualization, paravirtualization only virtualizes 8 | some components of the underlying machine. This paravirtualization requires 9 | modifications to the guest operating systems but not the applications running 10 | on it. Essentially, Xen sacrifices the ability to run unmodified guest 11 | operating systems for improved performance. 12 | 13 | There are three components that need to be paravirtualized: 14 | 15 | - *Memory management.* Software page tables and tagged page tables are easier 16 | to virtualize. Unfortunately, x86 has neither. Xen paravirtualizes the 17 | hardware accessible page tables leaving guest operating systems responsible 18 | for managing them. Page table modifications are checked by Xen. 19 | - *CPU.* Xen takes advantage of x86's four privileges, called *rings*. Xen runs 20 | at ring 0 (the most privileged ring), the guest OS runs at ring 1, and the 21 | applications running in the guest operating systems run at ring 3. 22 | - *Device I/O.* Guest operating systems communicate with Xen via bounded 23 | circular producer/consumer buffers. Xen communicates to guest operating 24 | systems using asynchronous notifications. 25 | 26 | The Xen hypervisor implements mechanisms. Policy is delegated to a privileged 27 | domain called dom0 that has accessed to privileges that other domains don't. 28 | 29 | Finally, a look at some details about Xen: 30 | 31 | - *Control transfer.* Guest operating systems request services from the 32 | hypervisor via *hypercalls*. Hypercalls are like system calls except they are 33 | between a guest operating system and a hypervisor rather than between an 34 | application and an operating system. Furthermore, each guest OS registers 35 | interrupt handlers with Xen. When an event occurs, Xen toggles a bitmask to 36 | indicate the type of event before invoking the registered handler. 37 | - *Data transfer.* As mentioned earlier, data transfer is performed using a 38 | bounded circular buffer of I/O descriptors. Requests and responses are pushed 39 | on to the buffer. Requests can come out of order with respect to the 40 | requests. Moreover, requests and responses can be batched. 41 | - *CPU Scheduling.* Xen uses the BVT scheduling algorithm. 42 | - *Time and timers.* Xen supports real time (the time in nanoseconds from 43 | machine boot), virtual time (time that only increases when a domain is 44 | executing), and clock time (an offset added to the real time). 45 | - *Virtual address translation.* Other hypervisors present a virtual contiguous 46 | physical address space on top of the actual hardware address space. The 47 | hypervisor maintains a shadow page table mapping physical addresses to 48 | hardware addresses and installs real page tables into the MMU. This has high 49 | overhead. Xen takes an alternate approach. Guest operating systems issue 50 | hypercalls to manage page table entries that are directly inserted into the 51 | MMU's page table. After they are installed, the page table entries are 52 | read-only. 53 | - *Physical memory.* Memory is physically partitioned into reservations. 54 | - *Network.* Xen provides virtual firewall-routers with one or more virtual 55 | network interfaces, each with a circular ring buffer. 56 | - *Disk.* Xen presents virtual block devices each with a ring buffer. 57 | -------------------------------------------------------------------------------- /papers/baumann2015shielding.md: -------------------------------------------------------------------------------- 1 | ## [Shielding Applications from an Untrusted Cloud with Haven (2014)](https://scholar.google.com/scholar?cluster=12325554201123386346&hl=en&as_sdt=0,5) 2 | **Summary.** 3 | When running an application in the cloud, users have to trust (i) the cloud 4 | provider's software, (ii) the cloud provider's staff, and (iii) law enforcement 5 | with the ability to access user data. Intel SGX partially solves this problem 6 | by allowing users to run small portions of program on remote servers with 7 | guarantees of confidentiality and integrity. Haven leverages SGX and Drawbridge 8 | to run *entire legacy programs* with shielded execution. 9 | 10 | Haven assumes a very strong adversary which has access to all the system's 11 | software and most of the system's hardware. Only the processor and SGX hardware 12 | is trusted. Haven provides confidentiality and integrity, but not availability. 13 | It also does not prevent side-channel attacks. 14 | 15 | There are two main challenges that Haven's design addresses. First, most 16 | programs are written assuming a benevolent host. This leads to Iago attacks in 17 | which the OS subverts the application by exploiting its assumptions about the 18 | OS. Haven must operate correctly despite a *malicious host*. To do so, Haven 19 | uses a library operation system LibOS that is part of a Windows sandboxing 20 | framework called Drawbridge. LibOS implements a full OS API using only a few 21 | core host OS primitives. These core host OS primitives are used in a defensive 22 | way. A shield module sits below LibOS and takes great care to ensure that LibOS 23 | is not susceptible to Iago attacks. The user's application, LibOS, and the 24 | shield module are all run in an SGX enclave. 25 | 26 | Second, Haven aims to run *unmodified* binaries which were not written with 27 | knowledge of SGX. Real world applications allocate memory, load and run code 28 | dynamically, etc. Many of these things are not supported by SGX, so Haven (a) 29 | emulated them and (b) got the SGX specification revised to address them. 30 | 31 | Haven also implements an in-enclave encrypted file system in which only the 32 | root and leaf pages need to be written to stable storage. As of publication, 33 | however, Haven did not fully implement this feature. Haven is susceptible to 34 | replay attacks. 35 | 36 | Haven was evaluated by running Microsoft SQL Server and Apache HTTP Server. 37 | 38 | -------------------------------------------------------------------------------- /papers/bershad1995spin.md: -------------------------------------------------------------------------------- 1 | ## [SPIN -- An Extensible Microkernel for Application-specific Operating System Services (1995)](https://scholar.google.com/scholar?cluster=4910839957971330989&hl=en&as_sdt=0,5) 2 | **Summary.** 3 | Many operating systems were built a long time ago, and their performance was 4 | tailored to the applications and workloads at the time. More recent 5 | applications, like databases and multimedia applications, are quite different 6 | than these applications and can perform quite poorly on existing operating 7 | systems. SPIN is an extensible microkernel that allows applications to tailor 8 | the operating system to meet their needs. 9 | 10 | Existing operating systems fit into one of three categories: 11 | 12 | 1. They have no interface by which applications can modify kernel behavior. 13 | 2. They have a clean interface applications can use to modify kernel behavior 14 | but the implementation of the interface is inefficient. 15 | 3. They have an unconstrained interface that is efficiently implemented but 16 | does not provide isolation between applications. 17 | 18 | SPIN provides applications a way to efficiently and safely modify the behavior 19 | of the kernel. Programs in SPIN are divided into the user-level code and a 20 | spindle: a portion of user code that is dynamically installed and run in the 21 | kernel. The kernel provides a set of abstractions for physical and logical 22 | resources, and the spindles are responsible for managing these resources. The 23 | spindles can also register to be invoked when certain kernel events (i.e. page 24 | faults) occur. Installing spindles directly into the kernel provides 25 | efficiency. Applications can execute code in the kernel without the need for a 26 | context switch. 27 | 28 | To ensure safety, spindles are written in a typed object-oriented language. 29 | Each spindle is like an object; it contains local state and a set of methods. 30 | Some of these methods can be called by the application, and some are registered 31 | as callbacks in the kernel. A spindle checker uses a combination of static 32 | analysis and runtime checks to ensure that the spindles meet certain kernel 33 | invariants. Moreover, SPIN relies on advanced compiler technology to ensure 34 | efficient spindle compilation. 35 | 36 | General purpose high-performance computing, parallel processing, multimedia 37 | applications, databases, and information retrieval systems can benefit from the 38 | application-specific services provided by SPIN. Using techniques such as 39 | 40 | - extensible IPC; 41 | - application-level protocol processing; 42 | - fast, simple, communication; 43 | - application-specific file systems and buffer cache management; 44 | - user-level scheduling; 45 | - optimistic transaction; 46 | - real-time scheduling policies; 47 | - application-specific virtual memory; and 48 | - runtime systems with memory system feedback, 49 | 50 | applications can be implemented more efficiently on SPIN than on traditional 51 | operating systems. 52 | 53 | -------------------------------------------------------------------------------- /papers/brewer2012cap.md: -------------------------------------------------------------------------------- 1 | # [CAP Twelve Years Later: How the "Rules" Have Changed (2012)](https://scholar.google.com/scholar?cluster=17642052422667212790) 2 | The CAP theorem dictates that in the face of network partitions, replicated 3 | data stores must choose between high availability and strong consistency. In 4 | this 12 year retrospective, Eric Brewer takes a look back at the CAP theorem 5 | and provides some insights. 6 | 7 | ## Why 2 of 3 is Misleading 8 | The CAP theorem is misleading for three reasons. 9 | 10 | 1. Partitions are rare, and when a system is not partitioned, the system can 11 | have both strong consistency and high availability. 12 | 2. Consistency and availability can vary by subsystem or even by operation. The 13 | granularity of consistency and availability doest not have to be an entire 14 | system. 15 | 3. There are various degrees of consistency and various levels of consistency. 16 | 17 | ## CAP-Latency Connection 18 | After a node experiences a delay when communicating with another node, it has 19 | to make a choice between (a) aborting the operation and sacrificing consistency 20 | of (b) continuing with the operation anyway and sacrificing consistency. 21 | Essentially, a partition is a time bound on communication. Viewing partitions 22 | like this leads to three insights: 23 | 24 | 1. There is no global notion of a partition. 25 | 2. Nodes can detect partitions and enter a special partition mode. 26 | 3. Users can vary the time after which they consider the system partitioned. 27 | 28 | ## Managing Partitions 29 | Systems should take three steps to handle partitions. 30 | 31 | 1. Detect the partition. 32 | 2. Enter a special partition mode in which nodes either (a) limit the 33 | operations which can proceed thereby decreasing availability or (b) continue 34 | with the operations decreasing consistency, making sure to log enough 35 | information to recover after the partition. 36 | 3. Recover from the partition once communication resumes. 37 | 38 | ## Which Operations Should Proceed 39 | The operations which a node permits during a partition depends on the 40 | invariants it is willing to sacrifice. For example, nodes may temporarily 41 | violate unique id constraints during a partition since they are easy to detect 42 | and resolve. Other invariants are too important to violate, so operations that 43 | could potentially violate them are stalled. 44 | 45 | ## Partition Recovery 46 | Once a system recovers from a partition it has to 47 | 48 | 1. make the state consistent again, and 49 | 2. compensate any mistakes made during the partition. 50 | 51 | Sometimes a system is unable to automatically make the state consistent and 52 | depends on manual intervention. Sometimes, the system can automatically restore 53 | the state because it carefully rejected some operations during the partition. 54 | Other systems can automatically restore consistency because they use clever 55 | data structures like CRDTs. 56 | 57 | Some systems, especially those which externalize actions (e.g. ATMs), must 58 | sometimes issue compensations (e.g. emailing users). 59 | -------------------------------------------------------------------------------- /papers/brin1998anatomy.md: -------------------------------------------------------------------------------- 1 | # [The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)](https://scholar.google.com/scholar?cluster=9820961755208603037) 2 | In this paper, Larry and Sergey present Google. 3 | 4 | ## Features 5 | One of Google's main features is its use of PageRank to rank search results. 6 | For an arbitrary site $A$ and a set of sites $T_1, \ldots, T_n$ that link to 7 | it, the PageRank function satisfies the following equation: 8 | 9 | $$ 10 | PageRank(A) = (1 - d) + d \sum_{i=1}^n \frac{PageRank(T_i)}{OutDegree(T_i)} 11 | $$ 12 | 13 | where $d$ is a damping factor. Intuitively, sites get a high PageRank if they 14 | are frequently linked to or linked to by other sites with high PageRank. 15 | 16 | Google also associates anchor text (the text part of a link) with the page 17 | being linked to instead of the page with the link. This anchor text provides 18 | information about a site which the site itself may not include. 19 | 20 | Google also provides a number of other miscellaneous features. For example, it 21 | scores words in documents based on their size and boldness. It also maintains a 22 | repository of HTML that researchers can use as a dataset. 23 | 24 | ## Related Work 25 | Most related work is in the field of information retrieval, but this literature 26 | typically assumes that information is being extracted from a small set of 27 | homogeneous documents. The web is huge and varied, and users can even 28 | maliciously try to affect search engine results. 29 | 30 | ## System Anatomy 31 | A URL Server sends URLs to a set of crawlers which download sites and send them 32 | to a store server which compresses them and stores them in a repository. An 33 | indexer uses the repository to build a forward index (stored in a set of 34 | barrels) which maps documents to a set of hits. Each hit records a word in the 35 | document, its position in the document, its boldness, etc. The indexer also 36 | stores all anchors in a file which a URL resolver uses to build a graph for 37 | PageRank. A sorter periodically converts forward indexes into inverted indexes. 38 | 39 | ## Data Structures 40 | - **BigFiles.** Big files are really big virtual files that map over multiple file systems. 41 | - **Repository.** The repository contains the HTML of crawled sites compressed 42 | with zlib. The repository contains a record for each site which contains its 43 | DocId, URL, length, and HTML contents. 44 | - **Document index.** The index is an ISAM index from DocId to pointers into 45 | the repository. It also contains pointers to the URL and title of the site. 46 | Additionally, there is a hash map which maps URLs to DocIds. 47 | - **Lexicon.** The lexicon is an in-memory list of roughly 14 million words. 48 | The words are stored contiguously with nulls separating them. 49 | - **Hit lists.** Each hit needs to compactly represent the font size, 50 | capitalization, position, etc of a word. Google uses some custom bit tricks 51 | to do this with very few bits. 52 | - **Forward index.** The forward index contains a list of DocIds followed by a 53 | WordId and a series of hits. The index is partitioned by WordId across a 54 | number of barrels. 55 | - **Inverted index.** The inverted index maps words to DocIds. 56 | 57 | ## Crawling the Web 58 | The URL server and the crawlers are implemented in Python. Each crawler has 59 | about 300 open connections, performs event loop asynchronous IO, and maintains 60 | a local DNS cache to boost performance. Crawling is the most fragile part of 61 | the entire search engine and requires a lot of corner case handling. 62 | 63 | ## Indexing 64 | Google implements a custom parser to handle exotic HTML. The HTML is converted 65 | into a set of hits using the lexicon. New words not in the lexicon are logged 66 | and later added to the lexicon. 67 | 68 | ## Search 69 | A search query is converted into a set of WordIds. The index is searched to 70 | find a set of documents that contain the words. The results are ranked by a 71 | number of factors (e.g. PageRank, proximity for range searches) before being 72 | returned to the user. 73 | 74 | 77 | -------------------------------------------------------------------------------- /papers/bugnion1997disco.md: -------------------------------------------------------------------------------- 1 | ## [Disco: Running Commodity Operating Systems on Scalable Multiprocessors (1997)](https://scholar.google.com/scholar?cluster=17298410582406300869&hl=en&as_sdt=0,5) 2 | **Summary.** 3 | Operating systems are complex, million line code bases. Multiprocessors were 4 | becoming popular, but it was too difficult to modify existing commercial 5 | operating systems to take full advantage of the new hardware. Disco is a 6 | *virtual machine monitor*, or *hypervisor*, that uses virtualization to run 7 | commercial virtual machines on cache-coherent NUMA multiprocessors. Guest 8 | operating systems running on Disco are only slightly modified, yet are still 9 | able to take advantage of the multiprocessor. Moreover, Disco offers all the 10 | traditional benefits of a hypervisor (e.g. fault isolation). 11 | 12 | Disco provides the following interfaces: 13 | 14 | - *Processors.* Disco provides full virtualization of the CPU allowing for 15 | restricted direct execution. Some privileged registers are mapped to memory 16 | to allow guest operating systems to read them. 17 | - *Physical memory.* Disco virtualizes the guest operating system's physical 18 | address spaces, mapping them to hardware addresses. It also supports page 19 | migration and replication to alleviate the non-uniformity of a NUMA machine. 20 | - *I/O devices.* All I/O communication is simulated, and various virtual disks 21 | are used. 22 | 23 | Disco is implemented as follows: 24 | 25 | - *Virtual CPUs.* Disco maintains the equivalent of a process table entry for 26 | each guest operating system. Dicso runs in kernel mode, guest operating 27 | systems run in supervisor mode, and applications run in user mode. 28 | - *Virtual physical memory.* To avoid the overhead of physical to hardware 29 | address translation, Disco maintains a large software physical to hardware 30 | TLB. 31 | - *NUMA.* Disco migrates pages to the CPUs that access them frequently, and 32 | replicates read-only pages to the CPUs that read them frequently. This 33 | dynamic page migration and replications helps mask the non-uniformity of a 34 | NUMA machine. 35 | - *Copy-on-write disks.* Disco can map physical addresses in different guest 36 | operating systems to a read-only page in hardware memory. This lowers the 37 | memory footprint of running multiple guest operating systems. The shared 38 | pages are copy-on-write. 39 | - *Virtual network interfaces.* Disco runs a virtual subnet over which guests 40 | can communicate using standard communication protocols like NFS and TCP. 41 | Disco uses a similar copy-on-write trick as above to avoid copying data 42 | between guests. 43 | -------------------------------------------------------------------------------- /papers/burns2016borg.md: -------------------------------------------------------------------------------- 1 | ## [Borg, Omega, and Kubernetes (2016)](#borg-omega-and-kubernetes-2016) 2 | **Summary.** 3 | Google has spent the last decade developing three container management systems. 4 | *Borg* is Google's main cluster management system that manages long running 5 | production services and non-production batch jobs on the same set of machines 6 | to maximize cluster utilization. *Omega* is a clean-slate rewrite of Borg using 7 | more principled architecture. In Omega, all system state lives in a consistent 8 | Paxos-based storage system that is accessed by a multitude of components which 9 | act as peers. *Kubernetes* is the latest open source container manager that 10 | draws on lessons from both previous systems. 11 | 12 | All three systems use containers for security and performance isolation. 13 | Container technology has evolved greatly since the inception of Borg from 14 | chroot to jails to cgroups. Of course containers cannot prevent all forms of 15 | performance isolation. Today, containers also contain program images. 16 | 17 | Containers allow the cloud to shift from a machine-oriented design to an 18 | application oriented-design and tout a number of advantages. 19 | 20 | - The gap between where an application is developed and where it is deployed is 21 | shrunk. 22 | - Application writes don't have to worry about the details of the operating 23 | system on which the application will run. 24 | - Infrastructure operators can upgrade hardware without worrying about breaking 25 | a lot of applications. 26 | - Telemetry is tied to applications rather than machines which improves 27 | introspection and debugging. 28 | 29 | Container management systems typically also provide a host of other niceties 30 | including: 31 | 32 | - naming services, 33 | - autoscaling, 34 | - load balancing, 35 | - rollout tools, and 36 | - monitoring tools. 37 | 38 | In borg, these features were integrated over time in ad-hoc ways. Kubernetes 39 | organizes these features under a unified and flexible API. 40 | 41 | Google's experience has led a number of things to avoid: 42 | 43 | - Container management systems shouldn't manage ports. Kubernetes gives each 44 | job a unique IP address allowing it to use any port it wants. 45 | - Containers should have labels, not just numbers. Borg gave each task an index 46 | within its job. Kubernetes allows jobs to be labeled with key-value pairs and 47 | be grouped based on these labels. 48 | - In Borg, every task belongs to a single job. Kubernetes makes task management 49 | more flexible by allowing a task to belong to multiple groups. 50 | - Omega exposed the raw state of the system to its components. Kubernetes 51 | avoids this by arbitrating access to state through an API. 52 | 53 | Despite the decade of experience, there are still open problems yet to be 54 | solved: 55 | 56 | - Configuration. Configuration languages begin simple but slowly evolve into 57 | complicated and poorly designed Turing complete programming languages. It's 58 | ideal to have configuration files be simple data files and let real 59 | programming languages manipulate them. 60 | - Dependency management. Programs have lots of dependencies but don't manually 61 | state them. This makes automated dependency management very tough. 62 | -------------------------------------------------------------------------------- /papers/codd1970relational.md: -------------------------------------------------------------------------------- 1 | # [A Relational Model of Data for Large Shared Data Banks (1970)](https://scholar.google.com/scholar?cluster=1624408330930846885&hl=en&as_sdt=0,5) 2 | 3 | ## Summary 4 | In this paper, Ed Codd introduces the *relational data model*. Codd begins by 5 | motivating the importance of *data independence*: the independence of the way 6 | data is queried and the way data is stored. He argues that existing database 7 | systems at the time lacked data independence; namely, the ordering of 8 | relations, the indexes on relations, and the way the data was accessed was all 9 | made explicit when the data was queried. This made it impossible for the 10 | database to evolve the way data was stored without breaking existing programs 11 | which queried the data. The relational model, on the other hand, allowed for a 12 | much greater degree of data independence. After Codd introduces the relational 13 | model, he provides an algorithm to convert a relation (which may contain other 14 | relations) into *first normal form* (i.e. relations cannot contain other 15 | relations). He then describes basic relational operators, data redundancy, and 16 | methods to check for database consistency. 17 | 18 | ## Commentary 19 | 1. Codd's advocacy for data independence and a declarative query language have 20 | stood the test of time. I particularly enjoy one excerpt from the paper 21 | where Codd says, "The universality of the data sublanguage lies in its 22 | descriptive ability (not its computing ability)". 23 | 2. Database systems at the time generally had two types of data: collections 24 | and links between those collections. The relational model represented both 25 | as relations. Today, this seems rather mundane, but I can imagine this being 26 | counterintuitive at the time. This is also yet another example of a 27 | *unifying interface* which is demonstrated in both the Unix and System R 28 | papers. 29 | 30 | 31 | -------------------------------------------------------------------------------- /papers/conway2012logic.md: -------------------------------------------------------------------------------- 1 | ## [Logic and Lattices for Distributed Programming (2012)](TODO) ## 2 | **Summary.** 3 | CRDTs provide eventual consistency without the need for coordination. However, 4 | they suffer a *scope problem*: simple CRDTs are easy to reason about and use, 5 | but more complicated CRDTs force programmers to ensure they satisfy semilattice 6 | properties. They also lack composability. Consider, for example, a Students set 7 | and a Teams set. (Alice, Bob) can be added to Teams while concurrently Bob is 8 | removed from Students. Each individual set may be a CRDT, but there is no 9 | mechanism to enforce consistency between the CRDTs. 10 | 11 | Bloom and CALM, on the other hand, allow for mechanized program analysis to 12 | guarantee that a program can avoid coordination. However, Bloom suffers from a 13 | *type problem*: it only operates on sets which procludes the use of other 14 | useful structures such as integers. 15 | 16 | This paper merges CRDTs and Bloom together by introducing *bounded join 17 | semilattices* into Bloom to form a new language: Bloom^L. Bloom^L operates over 18 | semilattices, applying semilattice methods. These methods can be designated as 19 | non-monotonic, monotonic, or homomorphic (which implies monotonic). So long as 20 | the program avoids non-monotonic methods, it can be realized without 21 | coordination. Moreover, morphisms can be implemented more efficiently than 22 | non-homomorphic monotonic methods. Bloom^L comes built in with a family of 23 | useful semilattices like booleans ordered by implication, integers ordered by 24 | less than and greater than, sets, and maps. Users can also define their own 25 | semilattices, and Bloom^L allows smooth interoperability between Bloom 26 | collections and Bloom^L lattices. Bloom^L's semi-naive implementation is 27 | comparable to Bloom's semi-naive implementation. 28 | 29 | The paper also presents two case-studies. First, they implement a key-value 30 | store as a map from keys to values annotated with vector clocks: a design 31 | inspired from Dynamo. They also implement a purely monotonic shopping cart 32 | using custom lattices. 33 | -------------------------------------------------------------------------------- /papers/engler1995exokernel.md: -------------------------------------------------------------------------------- 1 | ## [Exokernel: An Operating System Architecture for Application-Level Resource Management (1995)](https://scholar.google.com/scholar?cluster=4636448334605780007&hl=en&as_sdt=0,5) 2 | **Summary.** 3 | Monolithic kernels provide a large number of abstractions (e.g. processes, 4 | files, virtual memory, interprocess communication) to applications. 5 | Microkernels push some of this functionality into user space but still provide 6 | a fixed set of abstractions and services. Providing these inextensible fixed 7 | abstractions is detrimental to applications. 8 | 9 | - An application cannot be best for all applications. Tradeoffs must be made 10 | which can impact performance for some applications. 11 | - A rigid set of abstractions can make it difficult for an application to layer 12 | on its own set of abstractions. For example, a user level threads package may 13 | encounter some difficulties of not having access to page faults. 14 | - Having a rigid set of abstractions means the abstractions are rarely updated. 15 | Innovative OS features are seldom integrated into real world OSes. 16 | 17 | The exokernel operating system architecture takes a different approach. It 18 | provides protected access to hardware and nothing else. All abstractions are 19 | implemented by library operating systems. The exokernel purely provides 20 | protected access to the unabstracted underlying hardware. 21 | 22 | The exokernel interface governs how library operating systems get, use, and 23 | release resources. Exokernels follow the following guidelines. 24 | 25 | - *Securely expose hardware.* All the details of the hardware (e.g. privileged 26 | instructions, DMA) should be exposed to libOSes. 27 | - *Expose allocation.* LibOSes should be able to request physical resources. 28 | - *Expose physical names.* The physical names of resources (e.g. physical page 29 | 5) should be exposed. 30 | - *Expose revocation.* LibOSes should be notified when resources are revoked. 31 | 32 | Exokernels use three main techniques to ensure protected access to the 33 | underlying hardware. 34 | 35 | 1. *Secure bindings.* A secure binding decouples authorization from use and is 36 | best explained through an example. A libOS can request that a certain entry 37 | be inserted into a TLB. The exokernel can check that the entry is valid. 38 | This is the authorization. Later, the CPU can use the TLB without any 39 | checking. This is use. The TLB entry can be used multiple times after being 40 | authorized only once. 41 | 42 | There are three forms of secure bindings. First are hardware mechanism like 43 | the TLB entries or screens in which each pixel is tagged with a process. 44 | Second are software mechanisms like TLB caching or packet filters. Third is 45 | downloading and executing code using type-safe languages, interpretation, or 46 | sandboxing. Exokernels can download Application-Specific Safe Handlers 47 | (ASHes). 48 | 2. *Visible resource revocation.* In traditional operating systems, resource 49 | revocation is made invisible to applications. For example, when an 50 | application's page is swapped to disk, it is not notified. The exokernel 51 | makes resource revocation visible by notifying the libOS. For example, each 52 | libOS is notified when its quantum is over. This allows it do things like 53 | only store the registers it needs. 54 | 3. *Abort protocol.* If a libOS is misbehaving and not responding to revocation 55 | requests, the exokernel can forcibly remove allocations. Naively, it could 56 | kill the libOS. Less naively, ti can simply remove all secure bindings. 57 | 58 | The paper also presents the Aegis exokernel and the ExOS library operating 59 | system. 60 | -------------------------------------------------------------------------------- /papers/golub1992microkernel.md: -------------------------------------------------------------------------------- 1 | # [Microkernel Operating System Architecture and Mach (1992)](https://scholar.google.com/scholar?cluster=1074648542567860981) 2 | A **microkernel** is a very minimal, very low-level piece of code that 3 | interfaces with hardware to implement the functionality needed for an operating 4 | system. Operating systems implemented using a microkernel architecture, rather 5 | than a monolithic kernel architecture, implement most of the operating system 6 | in user space on top of the microkernel. This architecture affords many 7 | advantages including: 8 | 9 | - **tailorability**: many operating systems can be run on the same microkernel 10 | - **portability**: most hardware-specific code is in the microkernel 11 | - **network accessibility**: operating system services can be provided over the 12 | network 13 | - **extensibility**: new operating system environments can be tested along side 14 | existing ones 15 | - **real-time**: the kernel does not hold locks for very long 16 | - **multiprocessor support**: microkernel operations can be parallelized across 17 | processors 18 | - **multicomputer support**: microkernel operations can be parallelized across 19 | computers 20 | - **security**: a microkernel is a small trusted computing base 21 | 22 | This paper describes various ways in which operating systems can be implemented 23 | on top of the Mach microkernel. Mach's key features include: 24 | 25 | - **task and thread management**: Mach supports tasks (i.e. processes) and 26 | threads. Mach implements a thread scheduler, but privileged user level 27 | programs can alter the thread scheduling algorithms. 28 | - **interprocess communication**: Mach implements a capabilities based IPC 29 | mechanism known as ports. Every object in Mach (e.g. tasks, threads, memory) 30 | is managed by sending message to its corresponding port. 31 | - **memory object management**: Memory is represented as memory objects managed 32 | via ports. 33 | - **system call redirection**: Mach allows a number of system calls to be 34 | caught and handled by user level code. 35 | - **device support**: Devices are represented as ports. 36 | - **user multiprocessing**: Tasks can use a user-level threading library. 37 | - **multicomputer support**: Mach abstractions can be transparently implemented 38 | on distributed hardware. 39 | - **Continuations**: In a typical operating system, when a thread blocks, all 40 | of its registers are stored somewhere before another piece of code starts to 41 | run. Its stack is also left intact. When the blocking thread is resumed, its 42 | stored registers are put back in place and the thread starts running again. 43 | This can be wasteful if the thread doesn't need all of the registers or its 44 | stack. In Mach, threads can block with a continuation: an address and a bunch 45 | of state. This can be more efficient since the thread only saves what it 46 | needs to keep executing. 47 | 48 | Many different operating systems can be built on top of Mach. It's ideal that 49 | applications built for these operating systems can continue to run unmodified 50 | even when the underlying OS is implemented on top of Mach. A key part of this 51 | virtualization is something called an **emulation library**. An emulation 52 | library is a piece of code inserted into an application's address space. When a 53 | program issues system call, Mach immediately redirects control flow to the 54 | emulation library to process it. The emulation library can then handle the 55 | system call by, for example, issuing an RPC to an operating system server. 56 | 57 | Operating systems built on Mach can be architected in one of three ways: 58 | 59 | 1. The entire operating system can be baked into the emulation library. 60 | 2. The operating system can be shoved into a single multithreaded Mach task. 61 | This architecture can be very memory efficient, and is easy to implement 62 | since the guest operating system can be ported with very little code 63 | modification. 64 | 3. The operating system can be decomposed into a larger number of smaller 65 | processes that communicate with one another using IPC. This architecture 66 | encourages modularity, and allows certain operating system components to be 67 | reused between operating systems. This approach can lead to inefficiency, 68 | especially if IPC is not lighting fast! 69 | -------------------------------------------------------------------------------- /papers/graefe1990encapsulation.md: -------------------------------------------------------------------------------- 1 | # [Encapsulation of Parallelism in the Volcano Query Processing System (1990)](https://scholar.google.com/scholar?cluster=12203514891599153151) 2 | The Volcano query processing system uses the operator model of query execution 3 | that we all know and love. In this paper, Graefe discusses the **exchange 4 | operator**: an operator that you can insert into a query plan which 5 | automatically parallelizes the execution of the query plan. The exchange 6 | operator allows queries to be executed in parallel without having to rewrite 7 | existing operator. 8 | 9 | ## The Template Approach 10 | Parallel databases like Gamma and Bubba used a **template approach** to query 11 | parallelization. With the template approach, algorithms are plugged into 12 | templates which abstract away the details of where tuples come from and where 13 | they are sent to. They also abstract away the mechanisms used to send a tuple 14 | over the network or to another process using some IPC mechanism. 15 | Unfortunately, the template approach *required* that tuples either be sent over 16 | the network or over IPC which greatly reduced its performance. 17 | 18 | ## Volcano Design 19 | Volcano implements queries as a tree of iterators with an open-next-close 20 | interface. The paper discusses something called state records which allow the 21 | same operator multiple times with different parameters, though this seems 22 | obsolete now that we have modern programming languages. Iterators return 23 | structures which contain record ids and pointers to the records which are 24 | pinned in the buffer pool. 25 | 26 | ## The Operator Model of Parallelization 27 | In the operator model of parallelization, exchange operators are placed in 28 | query plans which automatically parallelize its execution. Exchange operators 29 | provide three forms of parallelism. 30 | 31 | 1. **Pipeline parallelism.** The exchange operator allows its child and parent 32 | to run in different processes. When opened, the exchange operator forks a 33 | child and establishes a region of shared memory called a **port** which the 34 | parent and child use to communicate with one another. The child part of the 35 | exchange operator (known as the **producer**) continuously calls next on its 36 | child, buffers the records into batches called packets, and writes them into 37 | the port (with appropriate synchronization). The parent part of the exchange 38 | operator (known as the consumer **consumer**) reads a record from a packet 39 | in the port whenever its next method is invoked. For flow control, the 40 | producer can decrement a semaphore and the producer can increment a 41 | semaphore. 42 | 2. **Bushy parallelism.** Busy parallelism is when multiple siblings in a query 43 | plan are executed in parallel. Bushy parallelism is really just pipeline 44 | parallelism applied to multiple siblings. 45 | 3. **Intra-operator parallelism.** Intra-operator parallelism allows the same 46 | operator to run in parallel on multiple partitions of input data. As before, 47 | the exchange operator forks a (master) producer and establishes a port. This 48 | master producer forks other producers and gives them the location of the 49 | port. Each forked producer might eventually invoke another group of exchange 50 | operators. The master consumer will fork a master producer and create a 51 | port. Producers can send messages to producers using round-robin, hash, or 52 | range partitioning. When a node is out of records, its sends an 53 | end-of-stream message to all consumers. Consumers end their stream when they 54 | count the appropriate number of end-of-stream messages. Unlike the pipelined 55 | parallelism version of the exchange operator, the intra-operator version is 56 | not so obvious how to implement. Moreover, I think it involves modifications 57 | to the leaves of a query plan. 58 | 59 | ## Variants of Exchange 60 | Producers can also broadcast tuples to consumers. A merge operator sits on top 61 | of an exchange operator and merges together the sorted streams produced by each 62 | producer. The paper also discusses sorting files and having exchange operators 63 | in the middle of a query fragment, but it's hard to understand. 64 | -------------------------------------------------------------------------------- /papers/halevy2016goods.md: -------------------------------------------------------------------------------- 1 | ## [Goods: Organizing Google's Datasets (2016)](TODO) ## 2 | **Summary.** 3 | In fear of fettering development and innovation, companies often allow 4 | engineers free reign to generate and analyze datasets at will. This often leads 5 | to unorganized data lakes: a ragtag collection of datasets from a diverse set 6 | of sources. Google Dataset Search (Goods) is a system which uses unobstructive 7 | post-hoc metadata extraction and inference to organize Google's unorganized 8 | datasets and present curated dataset information, such as metadata and 9 | provenance, to engineers. 10 | 11 | Building a system like Goods at Google scale presents many challenges. 12 | 13 | - *Scale.* There are 26 billion datasets. *26 billion* (with a b)! 14 | - *Variety.* Data comes from a diverse set of sources (e.g. BigTable, Spanner, 15 | logs). 16 | - *Churn.* Roughly 5% of the datasets are deleted everyday, and datasets are 17 | created roughly as quickly as they are deleted. 18 | - *Uncertainty.* Some metadata inference is approximate and speculative. 19 | - *Ranking.* To facilitate useful dataset search, datasets have to be ranked by 20 | importance: a difficult heuristic-driven process. 21 | - *Semantics.* Extracting the semantic content of a dataset is useful but 22 | challenging. For example consider a file of protos that doesn't reference the 23 | type of proto being stored. 24 | 25 | The Goods catalog is a BigTable keyed by dataset name where each row contains 26 | metadata including 27 | 28 | - *basic metatdata* like timestamp, owners, and access permissions; 29 | - *provenance* showing the lineage of each dataset; 30 | - *schema*; 31 | - *data summaries* extracted from source code; and 32 | - *user provided annotations*. 33 | 34 | Moreover, similar datasets or multiple versions of the same logical dataset are 35 | grouped together to form *clusters*. Metadata for one element of a cluster can 36 | be used as metadata for other elements of the cluster, greatly reducing the 37 | amount of metadata that needs to be computed. Data is clustered by timestamp, 38 | data center, machine, version, and UID, all of which is extracted from dataset 39 | paths (e.g. `/foo/bar/montana/August01/foo.txt`). 40 | 41 | In addition to storing dataset metadata, each row also stores *status 42 | metadata*: information about the completion status of various jobs which 43 | operate on the catalog. The numerous concurrently executing batch jobs use 44 | *status metadata* as a weak form of synchronization and dependency resolution, 45 | potentially deferring the processing of a row until another job has processed 46 | it. 47 | 48 | The fault tolerance of these jobs is provided by a mix of job retries, 49 | BigTable's idempotent update semantics, and a watchdog that terminates 50 | divergent programs. 51 | 52 | Finally, a two-phase garbage collector tombstones rows that satisfy a garbage 53 | collection predicate and removes them one day later if they still match the 54 | predicate. Batch jobs do not process tombstoned rows. 55 | 56 | The Goods frontend includes dataset profile pages, dataset search driven by a 57 | handful of heuristics to rank datasets by importance, and teams dashboard. 58 | -------------------------------------------------------------------------------- /papers/hellerstein2010declarative.md: -------------------------------------------------------------------------------- 1 | ## [The Declarative Imperative: Experiences and Conjectures in Distributed Logic (2010)](https://scholar.google.com/scholar?cluster=1374149560926608837&hl=en&as_sdt=0,5) 2 | **Summary.** 3 | With (a strict interpretation of) Moore's Law in decline and an overabundance 4 | of compute resources in the cloud, performance necessitates parallelism. The 5 | rub? Parallel programming is difficult. Contemporaneously, Datalog (and 6 | variants) have proven effective in an increasing number of domains from 7 | networking to distributed systems. Better yet, declarative logic programming 8 | allows for programs to be built with orders of magnitude less code and permits 9 | formal reasoning. In this invited paper, Joe discusses his experiences with 10 | distributed declarative programming and conjectures some deep connections 11 | between logic and distributed systems. 12 | 13 | Joe's group has explored many distributed Datalog variants, the latest of which 14 | is *Dedalus*. Dedalus includes notions of time: both intra-node atomicity and 15 | sequencing and inter-node causality. Every table in Dedalus includes a 16 | timestamp in its rightmost column. Dedalus' rules are characterized by how they 17 | interact with these timestamps: 18 | 19 | - *Deductive rules* involve a single timestamp. They are traditional Datalog 20 | stamements. 21 | - *Inductive rules* involve a head with a timestamp one larger than the 22 | timestamps of the body. They represent the creation of facts from one point 23 | in time to the next point in time. 24 | - *Asynchronous rules* involve a head with a non-deterministically chosen 25 | timestamp. These rules capture the notion of non-deterministic message 26 | delivery. 27 | 28 | Joe's group's experience with distributed logic programming lead to the 29 | following conclusions: 30 | 31 | - Datalog can very concisely express distributed algorithms that involved 32 | recursive computations of transitive closures like web crawling. 33 | - Annotating relations with a location specifier column allows tables to be 34 | transparently partitions and allows for declarative communication: a form of 35 | "network data independence". This could permit many networking optimizations. 36 | - Stratifying programs based on timesteps introduces a notion of 37 | transactionality. Every operation taking place in the same timestamp occurs 38 | atomically. 39 | - Making all tables ephemeral and persisting data via explicit inductive rules 40 | naturally allows transience in things like soft-state caches without 41 | precluding persistence. 42 | - Treating events as a streaming join of inputs with persisted data is an 43 | alternative to threaded or event-looped parallel programming. 44 | - Monotonic programs parallelize embarrassingly well. Non-monotonicity requires 45 | coordination and coordination requires non-monotonicity. 46 | - Logic programming has its disadvantages. There is code redundancy; lack of 47 | scope, encapsulation, and modularity; and providing consistent distributed 48 | relations is difficult. 49 | 50 | The experience also leads to a number of conjectures: 51 | 52 | - The CALM conjecture stats that programs that can be expressed in monotonic 53 | Datalog are exactly the programs that can be implemented with 54 | coordination-free eventual consistency. 55 | - Dedalus' asynchronous rules allow for an infinite number of traces. Perhaps, 56 | all these traces are confluent and result in the same final state. Or perhaps 57 | they are all equivalent for some notion of equivalence. 58 | - The CRON conjecture states that messages sent to the past lead only to 59 | paradoxes if the message has non-monotonic implications. 60 | - If computation today is so cheap, then the real computation cost comes from 61 | coordination between strata. Thus, the minimum number of Dedalus timestamps 62 | required to implement a program represents its minimum *coordination 63 | complexity*. 64 | - To further decrease latency, programs can be computed approximately either by 65 | providing probabilistic bounds on their outputs or speculatively executing 66 | and fixing results in a post-hoc manner. 67 | -------------------------------------------------------------------------------- /papers/hellerstein2012madlib.md: -------------------------------------------------------------------------------- 1 | # [The MADlib Analytics Library or MAD Skills, the SQL (2012)](https://scholar.google.com/scholar?cluster=2154261383124050736) 2 | MADlib is a library of machine learning and statistics functions that 3 | integrates into a relational database. For example, you can store labelled 4 | training data in a relational database and run logistic regression over it like 5 | this: 6 | 7 | ```sql 8 | SELECT madlib.logregr_train( 9 |   'patients',                           -- source table 10 |   'patients_logregr',                   -- output table 11 |   'second_attack',                      -- labels 12 |   'ARRAY[1, treatment, trait_anxiety]', -- features 13 |   NULL,                                 -- grouping columns 14 |   20,                                   -- max number of iterations 15 |   'irls'                                -- optimizer 16 | ); 17 | ``` 18 | 19 | MADlib programming is divided into two conceptual types of programming: 20 | macro-programming and micro-programming. **Macro-programming** deals with 21 | partitioning matrices across nodes, moving matrix partitions, and operating on 22 | matrices in parallel. **Micro-programming** deals with writing efficient code 23 | which operates on a single chunk of a matrix on one node. 24 | 25 | ## Macro-Programming 26 | MADlib leverages user-defined aggregates to operate on matrices in parallel. A 27 | user defined-aggregate over a set of type `T` comes in three pieces. 28 | 29 | - A **transition function** of type `A -> T -> A` folds over the set. 30 | - A **merge function** of type `A -> A -> A` merges intermediate aggregates. 31 | - A **final function** `A -> B` translates the final aggregate. 32 | 33 | Standard user-defined aggregates aren't sufficient to express a lot of machine 34 | learning algorithms. They suffer two main problems: 35 | 36 | 1. User-defined aggregates cannot easily iterate over the same data multiple 37 | times. Some solutions involve counted iteration by joining with virtual 38 | tables, window aggregates, and recursion. MADlib elects to express iteration 39 | using small bits of Python driver code that stores state between iterations 40 | in temporary tables. 41 | 2. User-defined aggregates are not polymorphic. Each aggregate must explicitly 42 | declare the type of its input, but some aggregates are pretty generic. 43 | MADlib uses Python UDFs which generate SQL based on input type. 44 | 45 | ## Micro-Programming 46 | MADlib user-defined code calls into fast linear algebra libraries (e.g. Eigen) 47 | for dense linear algebra. MADlib implements its own sparse linear algebra 48 | library in C. MADlib also provides a C++ abstraction for writing low-level 49 | linear algebra code. Notably, it translates C++ types into database types and 50 | integrates nicely with libraries like Eigen. 51 | 52 | ## Examples 53 | Least squares regression can be computed in a single pass of the data. Logistic 54 | regression and k-means clustering require a Python driver to manage multiple 55 | iterations. 56 | 57 | ## University Research 58 | Wisconsin implemented stochastic gradient descent in MADlib. Berkeley and 59 | Florida implemented some statistic text analytics features including text 60 | feature expansion, approximate string matching, Viterbi inference, and MCMC 61 | inference (though I don't know what any of these are). 62 | 63 | 64 | 65 | 66 | -------------------------------------------------------------------------------- /papers/hindman2011mesos.md: -------------------------------------------------------------------------------- 1 | ## [Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center (2011)](https://scholar.google.com/scholar?cluster=816726489244916508&hl=en&as_sdt=0,5) 2 | See [`https://github.com/mwhittaker/mesos_talk`](https://github.com/mwhittaker/mesos_talk). 3 | 4 | -------------------------------------------------------------------------------- /papers/kohler2000click.md: -------------------------------------------------------------------------------- 1 | ## [The Click Modular Router (2000)](https://goo.gl/t1AlsN) ## 2 | **Summary.** 3 | Routers are more than routers. They are also firewalls, load balancers, address 4 | translators, etc. Unfortunately, implementing these additional router 5 | responsibilities is onerous; most routers are closed platforms with inflexible 6 | designs. The Click router architecture, on the other hand, permits the creation 7 | of highly modular and flexible routers with reasonable performance. Click is 8 | implemented in C++ and compiles router specifications to routers which run on 9 | general purpose machines. 10 | 11 | Much like how Unix embraces modularity by composing simple *programs* using 12 | *pipes*, the Click architecture organizes a network of *elements* connected by 13 | *connections*. Each element represents an atomic unit of computation (e.g. 14 | counting packets) and is implemented as a C++ object that points to the other 15 | elements to which it is connected. Each element has input and output ports, can 16 | be provided arguments as configuration strings, and can expose arbitrary 17 | methods to other elements and to users. 18 | 19 | Connections and ports are either *push-based* or *pull-based*. The source 20 | element of a push connection pushes data to the destination element. For 21 | example, when a network device receives a packet, it pushes it to other 22 | elements. Dually, the destination element of a pull connection can request to 23 | pull data from the source element or receive null if no such data is available. 24 | For example, when a network device is ready to be written to, it may pull data 25 | from other elements. Ports can be designated as pull, push, or agnostic. Pull 26 | ports must be matched with other pull ports, push ports must be matched with 27 | other push ports, and agnostic ports are labeled as pull or push ports during 28 | router initialization. *Queues* are packet storing elements with a push input 29 | port and pull output port; they form the interface between push and pull 30 | connections. 31 | 32 | Some elements can process packets with purely local information. Other elements 33 | require knowledge of other elements. For example, a packet dropping element 34 | placed before a queue might integrate the length of the queue in its packet 35 | dropping policy. As a compromise between purely local information and complete 36 | global information, Click provides *flow-based router context* to elements 37 | allowing them to answer queries such as "what queue would a packet reach if I 38 | sent it out of my second port?". 39 | 40 | Click routers are specified using a simple declarative configuration language. 41 | The language allows users to group elements into *compound elements* that 42 | behave as a single element. 43 | 44 | The Click kernel driver is a single kernel thread that processes elements on a 45 | task queue. When an element is processed, it may in turn push data to or pull 46 | data from other elements forcing them to be processed. To avoid interrupt and 47 | device management overheads, the driver uses polling. Every input and output 48 | device polls for data whenever it is run by the driver. Router configurations 49 | are loaded and element methods are called via the `/proc` file system. The 50 | driver also supports hot-swapping in new router configurations which take over 51 | the state of the previous router. 52 | 53 | The authors implement a fully compliant IP router using Click and explore 54 | various extensions to it including scheduling and dropping packets. The 55 | performance of the IP router is measured and analyzed. 56 | -------------------------------------------------------------------------------- /papers/kornacker2015impala.md: -------------------------------------------------------------------------------- 1 | ## [Impala: A Modern, Open-Source SQL Engine for Hadoop (2015)](https://scholar.google.com/scholar?cluster=14277865292469814912&hl=en&as_sdt=0,5) 2 | **Summary.** 3 | Impala is a distributed query engine built on top of Hadoop. That is, it builds 4 | off of existing Hadoop tools and frameworks and reads data stored in Hadoop 5 | file formats from HDFS. 6 | 7 | Impala's `CREATE TABLE` commands specify the location and file format of data 8 | stored in Hadoop. This data can also be partitioned into different HDFS 9 | directories based on certain column values. Users can then issue typical SQL 10 | queries against the data. Impala supports batch INSERTs but doesn't support 11 | UPDATE or DELETE. Data can also be manipulated directly by going through HDFS. 12 | 13 | Impala is divided into three components. 14 | 15 | 1. An Impala daemon (impalad) runs on each machine and is responsible for 16 | receiving queries from users and for orchestrating the execution of queries. 17 | 2. A single Statestore daemon (statestored) is a pub/sub system used to 18 | disseminate system metadata asynchronously to clients. The statestore has 19 | weak semantics and doesn't persist anything to disk. 20 | 3. A single Catalog daemon (catalogd) publishes catalog information through the 21 | statestored. The catalogd pulls in metadata from external systems, puts it 22 | in Impala form, and pushes it through the statestored. 23 | 24 | Impala has a Java frontend that performs the typical database frontend 25 | operations (e.g. parsing, semantic analysis, and query optimization). It uses a 26 | two phase query planner. 27 | 28 | 1. *Single node planning.* First, a single-node non-executable query plan tree 29 | is formed. Typical optimizations like join reordering are performed. 30 | 2. *Plan parallelization.* After a single node plan is formed, it is fragmented 31 | and divided between multiple nodes with the goal of minimizing data movement 32 | and maximizing scan locality. 33 | 34 | Impala has a C++ backed that uses Volcano style iterators with exchange 35 | operators and runtime code generation using LLVM. To efficiently read data from 36 | disk, Impala bypasses the traditional HDFS protocols. The backend supports a 37 | lot of different file formats including Avro, RC, sequence, plain test, and 38 | Parquet. 39 | 40 | For cluster and resource management, Impala uses a home grown Llama system that 41 | sits over YARN. 42 | 43 | -------------------------------------------------------------------------------- /papers/lagar2009snowflock.md: -------------------------------------------------------------------------------- 1 | ## [SnowFlock: Rapid Virtual Machine Cloning for Cloud Computing (2009)](https://scholar.google.com/scholar?cluster=3030124086251534312&hl=en&as_sdt=0,5) 2 | **Summary.** 3 | Public clouds like Amazon's EC2 or Google's Compute Engine allow users to 4 | elastically spawn a huge number virtual machines on a huge number of physical 5 | machines. However, spawning a VM can take on the order of minutes, and 6 | typically spawned VMs are launched in some static initial state. SnowFlock 7 | implements the VM fork abstraction, in which a parent VM forks a set of 8 | children VMs all of which inherit a snapshot of the parent. Moreover, SnowFlock 9 | implements this abstraction with subsecond latency. A subsecond VM fork 10 | implementation can be used for sandboxing, parallel computation (the focus of 11 | this paper), load handling, etc. 12 | 13 | SnowFlock is built on top of Xen. Specifically, it is a combination of 14 | modifications to the Xen hypervisor and a set of daemons running in dom0 which 15 | together forms a distributed system that manages virtual machine forking. 16 | Guests use a set of calls (i.e. `sf_request_ticket`, `sf_clone`, `sf_join`, 17 | `sf_kill`, and `sf_exit`) to request resources on other machines, fork 18 | children, wait for children, kill children, and exit from a child. This implies 19 | that applications must be modified. SnowFlock implements the forking mechanism 20 | and leaves policy to pluggable cluster framework management software. 21 | 22 | SnowFlock takes advantage of four insights with four distinct implementation 23 | details. 24 | 25 | 1. VMs can get very large, on the order of a couple of GBs. Copying these 26 | images between physical machines can saturate the network, and even when 27 | implemented using things like multicast, can still be slow. Thus, SnowFlock 28 | must reduce the amount of state transfered between machines. SnowFlock takes 29 | advantage of the fact that a newly forked VM doesn't need the entire VM 30 | image to start running. Instead, SnowFlock uses *VM Descriptors*: a 31 | condensed VM image that consists of VM metadata, a few memory pages, 32 | registers, a global descriptor table, and page tables. When a VM is forked, 33 | a VM descriptor for it is formed and sent to the children to begin running. 34 | 2. When a VM is created from a VM descriptor, it doesn't have all the memory it 35 | needs to continue executing. Thus, memory must be sent from the parent when 36 | it is first accessed. Parent VMs use copy-on-write to maintain an immutable 37 | copy of memory at the point of the snapshot. Children use Xen shadow pages 38 | to trap accesses to pages not yet present and request them from the parent 39 | VM. 40 | 3. Sometimes VMs access memory that they don't really need to get from the 41 | parent. SnowFlock uses two *avoidance heuristics* to avoid the transfer 42 | overhead. First, if new memory is being allocated (often indirectly through 43 | a call to something like malloc), the memory contents are not important and 44 | do not need to be paged in from the parent. A similar optimization is made 45 | for buffers being written to by devices. 46 | 4. Finally, parent and children VMs often access the same code and data. 47 | SnowFlock takes advantage of this data locality by prefetching; when one 48 | child VM requests a page of memory, the parent multicasts it to all 49 | children. 50 | 51 | Furthermore, the same copy-on-write techniques to maintain an immutable 52 | snapshot of memory are used on the disk. And, the parent and children virtual 53 | machines are connected by a virtual subnet in which each child is given an IP 54 | address based on its unique id. 55 | -------------------------------------------------------------------------------- /papers/lehman1999t.md: -------------------------------------------------------------------------------- 1 | ## [T Spaces: The Next Wave (1999)](https://goo.gl/mxIv4g) ## 2 | **Summary.** 3 | T Spaces is a 4 | 5 | > tuplespace-based network communication buffer with database capabilities that 6 | > enables communication between applications and devices in a network of 7 | > heterogeneous computers and operating systems 8 | 9 | Essentially, it's Linda++; it implements a Linda tuplespace with a couple new 10 | operators and transactions. 11 | 12 | The paper begins with a history of related tuplespace based work. The notion of 13 | a shared collaborative space originated from AI *blackboard systems* popular in 14 | the 1970's, the most famous of which was the Hearsay-II system. Later, the 15 | Stony Brook microcomputer Network (SBN), a cluster organized in a torus 16 | topology, was developed at Stony Brook, and Linda was invented to program it. 17 | Over time, the domain in which tuplespaces were popular shifted from parallel 18 | programming to distributed programming, and a huge number of Linda-like systems 19 | were implemented. 20 | 21 | T Spaces is the marriage of tuplespaces, databases, and Java. 22 | 23 | - *Tuplespaces* provide a flexible communication model; 24 | - *databases* provide stability, durability, and advanced querying; and 25 | - *Java* provides portability and flexibility. 26 | 27 | T Spaces implements a Linda tuplespace with a few improvements: 28 | 29 | - In addition to the traditional `Read`, `Write`, `Take`, `WaitToRead`, and 30 | `WaitToTake` operators, T Spaces also introduces a `Scan`/`ConsumingScan` 31 | operator to read/take all tuples matched by a query and a `Rhonda` operator 32 | to exchange tuples between processes. 33 | - Users can also dynamically register new operators, the implementation of 34 | which takes of advantage of Java. 35 | - Fields of tuples are indexed by name, and tuples can be queried by named 36 | value. For example, the query `(foo = 8)` returns *all* tuples (of any type) 37 | with a field `foo` equal to 8. These indexes are similar to the inversions 38 | implemented in Phase 0 of System R. 39 | - Range queries are supported. 40 | - To avoid storing large values inside of tuples, files URLs can instead be 41 | stored, and T Spaces transparently handles locating and transferring the file 42 | contents. 43 | - T Spaces implements a group based ACL form of authorization. 44 | - T Spaces supports transactions. 45 | 46 | To evaluate the expressiveness and performance of T Spaces, the authors 47 | implement a collaborative web-crawling application, a web-search information 48 | delivery system, and a universal information appliance. 49 | -------------------------------------------------------------------------------- /papers/letia2009crdts.md: -------------------------------------------------------------------------------- 1 | ## [CRDTs: Consistency without concurrency control (2009)](https://scholar.google.com/scholar?cluster=9773072957814807258&hl=en&as_sdt=0,5) 2 | **Overview.** 3 | Concurrently updating distributed mutable data is a challenging problem that 4 | often requires expensive coordination. *Commutative replicated data types* 5 | (CRDTs) are data types with commutative update operators that can provide 6 | convergence without coordination. Moreover, non-trivial CRTDs exist; this paper 7 | presents Treedoc: an ordered set CRDT. 8 | 9 | **Ordered Set CRDT.** 10 | An ordered set CRDT represents an ordered sequence of atoms. Atoms are 11 | associated with IDs with five properties: 12 | 1. Two replicas of the same atom have the same ID. 13 | 2. No two atoms have the same ID. 14 | 3. IDs are constant for the lifetime of an ordered set. 15 | 4. IDs are totally ordered. 16 | 5. The space if IDs is dense. That is for all IDS P and F where P < F, there 17 | exists an ID N such that P < N < F. 18 | 19 | The ordered set supports two operations: 20 | 21 | 1. insert(ID, atom) 22 | 2. delete(ID) 23 | 24 | where atoms are ordered by their corresponding ID. Concretely, Treedoc is 25 | represented as a tree, and IDs are paths in the tree ordered by an infix 26 | traversal. Nodes in the tree can be *major nodes* and contain multiple 27 | *mini-nodes* where each mini-node is annotated with a totally ordered 28 | *disambiguator* unique to each node. 29 | 30 | Deletes simply tombstone a node. Inserts work like a typical binary tree 31 | insertion. To avoid the tree and IDs from getting too large, the tree can 32 | periodically be *flattened*: the tree is restructured into an array of nodes. 33 | This operation does not commute nicely with others, so a coordination protocol 34 | like 2PC is required. 35 | 36 | **Treedoc in the Large Scale.** 37 | Requiring a coordination protocol for flattening doesn't scale and runs against 38 | the spirit of CRDTs. It also doesn't handle churn well. Instead, nodes can be 39 | partitioned into a set of core nodes and a set of nodes in the nebula. The core 40 | nodes coordinate while the nebula nodes lag behind. There are protocols for 41 | nebula nodes to catch up to the core nodes. 42 | -------------------------------------------------------------------------------- /papers/li2014automating.md: -------------------------------------------------------------------------------- 1 | ## [Automating the Choice of Consistency Levels in Replicated Systems (2014)](https://scholar.google.com/scholar?cluster=5894108460168532172&hl=en&as_sdt=0,5) 2 | Geo-replicated data stores that replicate data have to choose between incurring 3 | the performance overheads of implementing strong consistency or the brain 4 | boggling semantics of weak consistency. Some data stores allow users to make 5 | this decision on a fine grained level, allowing some operations to operate with 6 | strong consistency while other operations operate under weak consistency. For 7 | example, [*Making Geo-Replicated Systems Fast as Possible, Consistent when 8 | Necessary*](https://scholar.google.com/scholar?cluster=4316742817395056095&hl=en&as_sdt=0,5) 9 | introduced RedBlue consistency in which users annotate operations as red 10 | (strong) or blue (weak). Typically, the process of choosing the consistency 11 | level for each operation has been a manual task. This paper builds off of 12 | RedBlue consistency and automates the process of labeling operations as red or 13 | blue. 14 | 15 | **Overview.** 16 | When using RedBlue consistency, users can decompose operations into generator 17 | operations and shadow operations to improve commutativity. This paper automates 18 | this process by converting Java applications that write state to a relational 19 | database to issue operations against CRDTs. Moreover, it uses a combination of 20 | static analysis and runtime checks to ensure that operations are invariant 21 | confluent. 22 | 23 | **Generating Shadow Operations.** 24 | Relations are just sets of tuples, so we can model them as CRDT sets. Moreover, 25 | we can model individual fields as CRDTs. This paper presents field CRDTs 26 | (PN-Counters, LWW-registers) and set CRDTs to model relations. Users then 27 | annotate SQL create statements indicating which CRDT to use. 28 | 29 | Using a custom JDBC driver, user operations can be converted into a 30 | corresponding shadow operation that issues CRDT operations. 31 | 32 | **Classification of Shadow Operations.** 33 | We want to find the database states and transaction arguments that guarantee 34 | invariant-confluence. This can be a very difficult (probably undecidable) 35 | problem in general. Instead, we simplify the problem by considering the 36 | possible trace of CRDT operations through a program. Even with this simplifying 37 | assumption, tracing execution through loops can be challenging. We can again 38 | simplify things by only considering loop where each iteration is independent of 39 | one another. 40 | 41 | We model each program as a regular expression over statements. We then unroll 42 | the regular expression to get the set of all execution traces. Using these 43 | traces we can construct a map from traces, or templates, to weakest 44 | preconditions that ensure invariant confluence. 45 | 46 | At runtime, we need only look up the weakest precondition given an execution 47 | trace and check that it is true. 48 | 49 | **Implementation.** 50 | The system is implemented as 15K lines of Java and 533 lines of OCaml. It 51 | builds off of MySQL, an existing Java parser, and Gemini. 52 | 53 | -------------------------------------------------------------------------------- /papers/liu1973scheduling.md: -------------------------------------------------------------------------------- 1 | # [Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment (1973)](https://scholar.google.com/scholar?cluster=11972780054098474552&hl=en&as_sdt=0,5) 2 | Consider a hard-real-time environment in which tasks *must* finish within some 3 | time after they are requested. We make the following assumptions. 4 | 5 | - (A1) Tasks are periodic with fixed periods. 6 | - (A2) Tasks must finish before they are next requested. 7 | - (A3) Tasks are independent. 8 | - (A4) Tasks have constant runtime. 9 | - (A5) Non-periodic tasks are not realtime. 10 | 11 | Thus, we can model each task $t_i$ as a period $T_i$ and runtime $C_i$. A 12 | scheduling algorithm that immediately preempts tasks to guarantee that the task 13 | with the highest priority is running is called a **preemptive priority 14 | scheduling algorithm**. We consider three preemptive priority scheduling 15 | algorithms: a static/fixed priority scheduler (in which priorities are assigned 16 | ahead of time), a dynamic priority scheduler (in which priorities are assigned 17 | at runtime), and a mixed scheduling algorithm. 18 | 19 | ## Fixed Priority Scheduling Algorithm 20 | First, a few definitions: 21 | 22 | - The **deadline** of a task is the time at which the next request is issued. 23 | - An **overflow** occurs at time $t$ if $t$ is the deadline for an unfulfilled 24 | task. 25 | - A schedule is **feasible** if there is no overflow. 26 | - The response time of a task is the time between the task's request and the 27 | task's finish time. 28 | - A **critical instant** for task $t$ is the instant where $t$ has the highest 29 | response time. 30 | 31 | It can be shown that the critical instant for any task occurs when the task is 32 | requested simultaneously with all higher priority tasks. This result lets us 33 | easily determine if a feasible fixed priority schedule exists by 34 | pessimistically assuming all tasks are scheduled at their critical instant. 35 | 36 | It also suggests that given two tasks with periodicities $T_1$ and $T_2$ where 37 | $T_1 < T_2$, we should give higher priority to the shorter task with period 38 | $T_1$. This leads to the **rate-monotonic priority scheduling algorithm** 39 | where we assign higher priorities to shorter tasks. A feasible static schedule 40 | exists if and only if a feasible rate-monotonic scheduling algorithm exists. 41 | 42 | Define **processor utilization** to be the fraction of time the processor 43 | spends running tasks. We say a set of tasks **fully utilize** the processor if 44 | there exists a feasible schedule for them, but increasing the running time of 45 | any of the tasks implies there is no feasible schedule. The least upper bound 46 | on processor utilization is the minimum processor utilization for tasks that 47 | fully utilize the processor. For $m$ tasks, the least upper bound is $m(2^{1/m} 48 | - 1)$ which approaches $\ln(2)$ for large $m$. 49 | 50 | ## Deadline Driven Scheduling Algorithm 51 | The **deadline driven scheduling algorithm** (or earliest deadline first 52 | scheduling algorithm) dynamically assigns the highest priority to the task with 53 | the most imminent deadline. This scheduling algorithm has a least upper bound 54 | of 100% processor utilization. Moreover, if any feasible schedule exists for a 55 | set of tasks, a feasible deadline driven schedule exists. 56 | 57 | ## Mixed Scheduling Algorithm. 58 | Scheduling hardware (at the time) resembled a fixed priority scheduler, but a 59 | dynamic scheduler could be implemented for less frequent tasks. A hybrid 60 | scheduling algorithm scheduled the $k$ most frequent tasks using the 61 | rate-monotonic scheduling algorithm and scheduled the rest using the deadline 62 | driven algorithm. 63 | 64 | 67 | -------------------------------------------------------------------------------- /papers/mckeen2013innovative.md: -------------------------------------------------------------------------------- 1 | ## [Innovative Instructions and Software Model for Isolated Execution (2013)](https://scholar.google.com/scholar?cluster=11948934428694485446&hl=en&as_sdt=0,5) 2 | **Summary.** 3 | Applications are responsible for managing an increasing amount of sensitive 4 | information. Intel SGX is a set of new instructions and memory access changes 5 | that allow users to put code and data into secured *enclaves* that are 6 | inaccessible even to privileged code. The enclaves provide confidentiality, 7 | integrity, and isolation. 8 | 9 | A process' virtual memory space is divided into different sections. There's a 10 | section for the code, a section for the stack, a section for the heap, etc. An 11 | enclave is just another region in the user's address space, except the enclave 12 | has some special properties. The enclave can store code and data. When a 13 | process is running code in the enclave, it can access data in the enclave. 14 | Otherwise, the enclave data is off limits. 15 | 16 | Each enclave is composed of a set of pages, and these pages are stored in the 17 | *Enclave Page Cache (EPC)*. 18 | 19 | +---------------------------------+ 20 | | .-----. .-----. .-----. .-----. | 21 | | | \| | \| | \| | \| | 22 | | | | | | | | | | | EPC 23 | | | | | | | | | | | 24 | | '.....' '.....' '.....' '.....' | 25 | +---------------------------------+ 26 | 27 | In addition to storing enclave pages, the EPC also stores SGX structures (I 28 | guess Enclave Page and SGX Structures Cache (EPSSC) was too long of an 29 | acronym). The EPC is protected from hardware and software access. A related 30 | structure, the *Enclave Page Cache Map* (EPCM), stores a piece of metadata for 31 | each active page in the EPC. Moreover, each enclave is assigned a *SGX Enclave 32 | Control Store* (SECS). There are instructions to create, add pages to, secure, 33 | enter, and exit an enclave. 34 | 35 | Computers have a finite, nay scarce amount of memory. In order to allow as many 36 | processes to operate with this scarce resource, operating systems implement 37 | paging. Active pages of memory are stored in memory, while inactive pages are 38 | flushed to the disk. Analogously, in order to allow as many processes to use 39 | enclaves as possible, SGX allows for pages in the EPC to be paged to main 40 | memory. The difficulty is that the operating system is not trusted and neither 41 | is main memory. 42 | 43 | +---------------------------------+ 44 | | .-----. .-----. .-----. | 45 | | | \| | \| | \| | 46 | | | VA | | ^ | | | | | EPC (small and trusted) 47 | | | | | | | | | | | 48 | | '.....' '..|..' | '.....' | 49 | +------------|-------|------------+ 50 | | | (paging) 51 | +------------|-------|----------------------- 52 | | .-----. | .--|--. .-----. .-----. 53 | | | \| | | | \| | \| | \| 54 | | | | | | v | | | | | ... main memory (big but not trusted) 55 | | | | | | | | | | 56 | | '.....' '.....' '.....' '.....' 57 | +-------------------------------------------- 58 | 59 | In order to page an EPC page to main memory, all cached translations that point 60 | to it must first be cleared. Then, the page is encrypted. A nonce, called a 61 | version, is created for the page and put into a special *Version Array* (VA) 62 | page in the EPC. A MAC is taken of the encrypted contents, the version, and the 63 | page's metadata and is stored with the file in main memory. When the page is 64 | paged back into the EPC, the MAC is checked against the version in the VA 65 | before the VA is cleared. 66 | -------------------------------------------------------------------------------- /papers/mckusick1984fast.md: -------------------------------------------------------------------------------- 1 | # [A Fast File System for UNIX (1984)](https://scholar.google.com/scholar?cluster=1900924654174602790) 2 | The **Fast Filesystem** (FFS) improved the read and write throughput of the 3 | original Unix file system by 10x by 4 | 5 | 1. increasing the block size, 6 | 2. dividing blocks into fragments, and 7 | 3. performing smarter allocation. 8 | 9 | The original Unix file system, dubbed "the old file system", divided disk 10 | drives into partitions and loaded a file system on to each partition. The 11 | filesystem included a superblock containing metadata, a linked list of free 12 | data blocks known as the **free list**, and an **inode** for every file. 13 | Notably, the file system was composed of **512 byte** blocks; no more than 512 14 | bytes could be transfered from the disk at once. Moreover, the file system had 15 | poor data locality. Files were often sprayed across the disk requiring lots of 16 | random disk accesses. 17 | 18 | The "new file system" improved performance by increasing the block size to any 19 | power of two at least as big as **4096 bytes**. In order to handle small files 20 | efficiently and avoid high internal fragmentation and wasted space, blocks were 21 | further divided into **fragments** at least as large as the disk sector size. 22 | 23 | ``` 24 | +------------+------------+------------+------------+ 25 | block | fragment 1 | fragment 2 | fragment 3 | fragment 4 | 26 | +------------+------------+------------+------------+ 27 | ``` 28 | 29 | Files would occupy as many complete blocks as possible before populating at 30 | most one fragmented block. 31 | 32 | Data was also divided into **cylinder groups** where each cylinder group included 33 | a copy of the superblock, a list of inodes, a bitmap of available blocks (as 34 | opposed to a free list), some usage statistics, and finally data blocks. The 35 | file system took advantage of hardware specific information to place data at 36 | rotational offsets specific to the hardware so that files could be read with as 37 | little delay as possible. Care was also taken to allocate files contiguously, 38 | similar files in the same cylinder group, and all the inodes in a directory 39 | together. Moreover, if the amount of available space gets too low, then it 40 | becomes more and more difficult to allocate blocks efficiently. For example, it 41 | becomes hard to allocate the files of a block contiguously. Thus, the system 42 | always tries to keep ~10% of the disk free. 43 | 44 | Allocation is also improved in the FFS. A top level global policy uses file 45 | system wide information to decide where to put new files. Then, a local policy 46 | places the blocks. Care must be taken to colocate blocks that are accessed 47 | together, but crowding a single cyclinder group can exhaust its resources. 48 | 49 | In addition to performance improvements, FFS also introduced 50 | 51 | 1. longer filenames, 52 | 2. advisory file locks, 53 | 3. soft links, 54 | 4. atomic file renaming, and 55 | 5. disk quota enforcement. 56 | -------------------------------------------------------------------------------- /papers/moore2006inferring.md: -------------------------------------------------------------------------------- 1 | ## [Inferring Internet Denial-of-Service Activity (2001)](TODO) ## 2 | **Summary.** 3 | This paper uses *backscatter analysis* to quantitatively analyze 4 | denial-of-service attacks on the Internet. Most flooding denial-of-service 5 | attacks involve IP spoofing, where each packet in an attack is given a faux IP 6 | address drawn uniformly at random from the space of all IP addresses. If the 7 | packet elicits the victim to issue a reply packet, then victims of 8 | denial-of-service attacks end up sending unsolicited messages to servers 9 | uniformly at random. By monitoring this *backscatter* at enough hosts, one can 10 | infer the number, intensity, and type of denial-of-service attacks. 11 | 12 | There are of course a number of assumptions upon which backscatter depends. 13 | 14 | 1. *Address uniformity*. It is assumed that DOS attackers spoof IP addresses 15 | uniformally at random. 16 | 2. *Reliable delivery*. It is assumed that packets, as part of the attack and 17 | response, are delivered reliably. 18 | 3. *Backscatter hypothesis*. It is assumed that unsolicited packets arriving at 19 | a host are actually part of backscatter. 20 | 21 | The paper performs a backscatter analysis on 1/256 of the IPv4 address space. 22 | They cluster the backscatter data using a *flow-based classification* to 23 | measure individual attacks and using an *event-based classification* to measure 24 | the intensity of attacks. The findings of the analysis are best summarized by 25 | the paper. 26 | -------------------------------------------------------------------------------- /papers/prabhakaran2005analysis.md: -------------------------------------------------------------------------------- 1 | ## [Analysis and Evolution of Journaling File Systems (2005)](TODO) ## 2 | **Summary.** 3 | The authors develop and apply two file system analysis techniques dubbed 4 | *Semantic Block-Level Analysis* (SBA) and *Semantice Trace Playback* (STP) to 5 | four journaled file systems: ext3, ReiserFS, JFS, and NTFS. 6 | 7 | - Benchmarking a file system can tell you for *which* workloads it is fast and 8 | for which it is slow. But, these benchmarks don't tell you *why* the file 9 | system performs the way it does. By leveraging semantic information about 10 | block traces, SBA aims to identify the cause of file system behavior. 11 | 12 | Users install an SBA driver into the OS and mount the file system of interest 13 | on to the SBA driver. The interposed driver intercepts and logs all 14 | block-level requests and responses to and from the disk. Moreover, the SBA 15 | driver is specialized to each file system under consideration so that it can 16 | interpret each block operation, categorizing it as a read/write to a journal 17 | block or regular data block. Implementing an SBA driver is easy to do, 18 | guarantees that no operation goes unlogged, and has low overhead. 19 | - Deciding the effectiveness of new file system policies is onerous. For 20 | example, to evaluate a new journaling scheme, you would traditionally have to 21 | implement the new scheme and evaluate it on a set of benchmarks. If it 22 | performs well, you keep the changes; otherwise, you throw them away. STP uses 23 | block traces to perform a light-weight simulation to analyze new file system 24 | policies without implementation overhead. 25 | 26 | STP is a user-level process that reads in block traces produced by SBA and 27 | file system operation logs and issues direct I/O requests to the disk. It can 28 | then be used to evaluate small, simple modifications to existing file 29 | systems. For example, it can be used to evaluate the effects of moving the 30 | journal from the beginning of the file system to the middle of the file 31 | system. 32 | 33 | The authors spend the majority of the paper examining *ext3*: the third 34 | extended file system. ext3 introduces journaling to ext2, and ext2 resembles 35 | the [Unix FFS](#a-fast-file-system-for-unix-1984) with partitions divided into 36 | groups each of which contains bitmaps, inodes, and regular data. ext3 comes 37 | with three journaling modes: 38 | 39 | 1. Using *writeback* journaling, metadata is journaled and data is 40 | asynchronously written to disk. This has the weakest consistency guarantees. 41 | 2. Using *ordered* journaling, data is written to disk before its associated 42 | metatada is journaled. 43 | 3. Using *data* journaling, both data and metadata is journaled before being 44 | *checkpointed*: copied from the journal to the disk. 45 | 46 | Moreover, operations are grouped into *compound transactions* and issued in 47 | batch. ext3 SBA analysis led to the following conclusions: 48 | 49 | - The fastest journaling mode depends heavily on the workload. 50 | - Compound transactions can lead to *tangled synchrony* in which asynchronous 51 | operations are made synchronous when placed in a transaction with synchronous 52 | operations. 53 | - In ordered journaling, ext3 doesn't concurrently write to the journal and the 54 | disk. 55 | 56 | SPT was also used to analyze the effects of 57 | 58 | - Journal position in the disk. 59 | - An adaptive journaling mode that dynamically chooses between ordered and data 60 | journaling. 61 | - Untangling compound transactions. 62 | - Data journaling in which diffs are journaled instead of whole blocks. 63 | 64 | SBA and STP was also applied to ReiserFS, JFS, and NTFS. 65 | -------------------------------------------------------------------------------- /papers/quigley2009ros.md: -------------------------------------------------------------------------------- 1 | # [ROS: an open-source Robot Operating System (2009)](https://scholar.google.com/scholar?cluster=143767492575573826) 2 | Writing code that runs on a robot is a *very* challenging task. Hardware varies 3 | from robot to robot, and the software required to perform certain tasks (e.g. 4 | picking up objects) can require an enormous amount of code (e.g. low-level 5 | drivers, object detection, motion planning, etc.). ROS, the Robot Operating 6 | System, is a framework for writing and managing distributed systems that run on 7 | robots. Note that ROS is not an operating system as the name suggests. 8 | 9 | ## Nomenclature 10 | A **node** is a process (or software module) that performs computation. Nodes 11 | communicate by sending **messages** (like protocol buffers) to one another. 12 | Nodes can publish messages to a **topic** or subscribe to a topic to receive 13 | messages. ROS also provides **services** (i.e. RPCs) which are defined by a 14 | service name, a request message type, and a response message type. 15 | 16 | ## What is ROS? 17 | In short, ROS provides the following core functionality. 18 | 19 | - ROS provides a messaging format similar to protocol buffers. Programmers 20 | define messages using a ROS IDL and a compiler generates code in various 21 | languages (e.g. C++, Octave, LISP). Processes running on robots then send 22 | messages to one another using XML RPC. 23 | - ROS provides command line tools to debug or alter the execution of 24 | distributed systems. For example, one command line tool can be used to log a 25 | message stream to disk without having to change any source code. These logged 26 | messages can then be replayed to develop and test other modules. Other 27 | command line tools are described below. 28 | - ROS organizes code into **packages**. A package is just a directory of code 29 | (or data or anything really) with some associated metadata. Packages are 30 | organized into **repositories** which are trees of directories with packages 31 | at the leaves. ROS provides a package manager to query repositories, download 32 | packages, and build code. 33 | 34 | ## Design Goals 35 | ROS has the following design goals which motivate its design. 36 | 37 | - **Peer-to-peer.** Because ROS systems are running on interconnected robots, 38 | it makes sense for systems to be written in a peer-to-peer fashion. For 39 | example, imagine two clusters of networked robots communicate over a slow 40 | wireless link. Running a master on either of the clusters will slow down the 41 | system. 42 | - **Multilingual.** ROS wants to support multiple programming languages which 43 | motivated its protobuf-like message format. 44 | - **Tools-based.** Instead of being a monolithic code base, ROS includes a 45 | disaggregated set of command line tools. 46 | - **Thin.** Many robot algorithms developed by researchers *could* be re-used 47 | but often isn't because it becomes tightly coupled with the researcher's 48 | specific environment. ROS encourages algorithms to developed agnostic to ROS 49 | so they can be easily re-used. 50 | - **Free and Open-Source** By being free and open-source, ROS is easier to 51 | debug and encourages collaboration between lots of researchers. 52 | 53 | ## Use Cases 54 | - **Debugging a single node.** Because ROS systems are loosely coupled modules 55 | communicating via RPC, one module can be debugged against other already 56 | debugged modules. 57 | - **Logging and playback.** As mentioned above, message streams can be 58 | transparently logged to disk for future replay. 59 | - **Packaged subsystems.** Programmers can describe the structure of a 60 | distributed system, and ROS can launch the system on across multiple hosts. 61 | - **Collaborative development.** ROS' packages and repositories encourage 62 | collaboration. 63 | - **Visualization and monitoring.** Message streams can be intercepted and 64 | visualized over time. Subscribe streams can also be filtered by expressions 65 | before being visualized. 66 | - **Composition of functionality.** Namespaces can be used to launch the same 67 | system multiple times. 68 | -------------------------------------------------------------------------------- /papers/ritchie1978unix.md: -------------------------------------------------------------------------------- 1 | # [The Unix Time-Sharing System (1974)](https://scholar.google.com/scholar?cluster=2132419950152599605&hl=en&as_sdt=0,5) 2 | Unix was an operating system developed by Dennis Ritchie, Ken Thompson, and 3 | others at Bell Labs. It was the successor to Multics and is probably the single 4 | most influential piece of software ever written. 5 | 6 | Earlier versions of Unix were written in assembly, but the project was later 7 | ported to C: probably the single most influential programming language ever 8 | developed. This resulted in a 1/3 increase in size, but the code was much more 9 | readable and the system included new features, so it was deemed worth it. 10 | 11 | The most important feature of Unix was its file system. Ordinary files were 12 | simple arrays of bytes physically stored as 512-byte blocks: a rather simple 13 | design. Each file was given an inumber: an index into an ilist of inodes. Each 14 | inode contained metadata about the file and pointers to the actual data of the 15 | file in the form of direct and indirect blocks. This representation made it 16 | easy to support (hard) linking. Each file was protected with 9 bits: the same 17 | protection model Linux uses today. Directories were themselves files which 18 | stored mappings from filenames to inumbers. Devices were modeled simply as 19 | files in the `/dev` directory. This unifying abstraction allowed devices to be 20 | accessed with the same API. File systems could be mounted using the `mount` 21 | command. Notably, Unix didn't support user level locking, as it was neither 22 | necessary nor sufficient. 23 | 24 | Processes in Unix could be created using a fork followed by an exec, and 25 | processes could communicate with one another using pipes. The shell was nothing 26 | more than an ordinary process. Unix included file redirection, pipes, and the 27 | ability to run programs in the background. All this was implemented using fork, 28 | exec, wait, and pipes. 29 | 30 | Unix also supported signals. 31 | -------------------------------------------------------------------------------- /papers/saltzer1984end.md: -------------------------------------------------------------------------------- 1 | # [End-to-End Arguments in System Design (1984)](https://scholar.google.com/scholar?cluster=9463646641349983499) 2 | This paper presents the **end-to-end argument**: 3 | 4 | > The function in question can completely and correctly be implemented only 5 | > with the knowledge and help of the application standing at the end points of 6 | > the communication system. Therefore, providing that questioned function as a 7 | > feature of the communication system itself is not possible. (Sometimes an 8 | > incomplete version of the function provided by the communication system may 9 | > be useful as a performance enhancement.) 10 | 11 | which says that in a layered system, functionality should, nay must be 12 | implemented as close to the application as possible to ensure correctness (and 13 | usually also performance). 14 | 15 | The end-to-end argument is motivated by an example file transfer scenario in 16 | which host A transfers a file to host B. Every step of the file transfer 17 | presents an opportunity for failure. For example, the disk may silently corrupt 18 | data or the network may reorder or drop packets. Any attempt by one of these 19 | subsystems to ensure reliable delivery is wasted effort since the delivery may 20 | still fail in another subsystem. The only way to guarantee correctness is to 21 | have the file transfer application check for correct delivery itself. For 22 | example, once it receives the entire file, it can send the file's checksum back 23 | to host A to confirm correct delivery. 24 | 25 | In addition to being necessary for correctness, applying the end-to-end 26 | argument also usually leads to improved performance. When a functionality is 27 | implemented in a lower level subsystem, every application built on it must pay 28 | the cost, even if it does not require the functionality. 29 | 30 | There are numerous other examples of the end-to-end argument: 31 | 32 | - Guaranteed packet delivery. 33 | - Secure data transmission. 34 | - Duplicate message suppression. 35 | - FIFO delivery. 36 | - Transaction management. 37 | - RISC. 38 | 39 | The end-to-end argument is not a hard and fast rule. In particular, it may be 40 | eschewed when implementing a functionality in a lower level can lead to 41 | performance improvements. Consider again the file transfer protocol above and 42 | assume the network drops one in every 100 packets. As the file becomes longer, 43 | the odds of a successful delivery become increasingly small making it 44 | prohibitively expensive for the application alone to ensure reliable delivery. 45 | The network may be able to perform a small amount of work to help guarantee 46 | reliable delivery making the file transfer more efficient. 47 | -------------------------------------------------------------------------------- /papers/stonebraker1987design.md: -------------------------------------------------------------------------------- 1 | # [The Design of the POSTGRES Storage System (1987)](https://scholar.google.com/scholar?cluster=6675294870941893293) 2 | POSTGRES, the ancestor of PostgreSQL, employed a storage system with three 3 | interesting characteristics: 4 | 5 | 1. No write-ahead logging (WAL) was used. In fact, there was no recovery code 6 | at all. 7 | 2. The entire database history was recorded and archived. Updates were 8 | converted to updates, and data could be queried arbitrarily far in the past. 9 | 3. The system was designed as a collection of asynchronous processes, rather 10 | than a monolithic piece of code. 11 | 12 | Transactions were sequentially assigned 40-bit transaction identifiers (XID) 13 | starting from 0. Each operation in a transaction was sequentially assigned a 14 | command identifiers (CID). Together the XID and CID formed a 48 bit interaction 15 | identifier (IID). Each IID was also assigned a two-bit transaction status and 16 | all IIDs were stored in a transaction log with a most recent **tail** of 17 | uncommitted transactions and a **body** of completed transactions. 18 | 19 | Every tuple in a relation was annotated with 20 | 21 | - a record id, 22 | - a min XID, CID, and timestamp, 23 | - a max XID, CID and timestamp, and 24 | - a forward pointer. 25 | 26 | The min values were associated with the transaction that created the record, 27 | and the max values were associated with the transaction that updated the 28 | record. When a record was updated, a new tuple was allocated with the same 29 | record id but updated min values, max values, and forward pointers. The new 30 | tuples were stored as diffs; the original tuple was the **anchor point**; and 31 | the forward pointers chained together the anchor point with its diffs. 32 | 33 | Data could be queried at a particular timestamp or in a range of timestamps. 34 | Moreover, the min and max values of the records could be extracted allowing for 35 | queries like this: 36 | 37 | SELECT Employee.min_timestamp, Eployee.max_timestamp, Employee.id 38 | FROM Employee[1 day ago, now] 39 | WHERE Employee.Salary > 10,000 40 | 41 | The timestamp of a transaction was not assigned when the transaction began. 42 | Instead, the timestamps were maintained in a TIME relation, and the timestamps 43 | in the records were left empty and asynchronously filled in. Upon creation, 44 | relations could be annotated as 45 | 46 | - **no archive** in which case timestamps were never filled in, 47 | - **light archive** in which timestamps were read from a TIME relation, or 48 | - **heavy archive** in which timestamps were lazily copied from the TIME 49 | relation into the records. 50 | 51 | POSTGRES allowed for any number of indexes. The type of index (e.g. B-tree) and 52 | the operations that the index efficiently supported were explicitly set by the 53 | user. 54 | 55 | A **vacuum cleaner** process would, by instruction of the user, vacuum records 56 | stored on disk to an archival storage (e.g. WORM device). The archived data was 57 | allowed to have a different set of indexes. The vacuum cleaning proceeded in 58 | three steps: 59 | 60 | 1. Data was archived and archive indexes were formed. 61 | 2. Anchor points were updated in the database. 62 | 3. Archived data space was reclaimed. 63 | 64 | The system could crash during this process which could lead to duplicate 65 | entries, but nothing more nefarious. The consistency guarantees were a bit weak 66 | compared to today's standards. Some crashes could lead to slowly accumulating 67 | un-reclaimed space. 68 | 69 | Archived data could be indexed by values and by time ranges efficiently using 70 | R-trees. Multi-media indexes which spanned the disk and archive were also 71 | supported. 72 | -------------------------------------------------------------------------------- /papers/vavilapalli2013apache.md: -------------------------------------------------------------------------------- 1 | ## [Apache Hadoop YARN: Yet Another Resource Negotiator (2013)](https://scholar.google.com/scholar?cluster=3355598125951377731&hl=en&as_sdt=0,5) 2 | **Summary.** 3 | Hadoop began as a MapReduce clone designed for large scale web crawling. As big 4 | data became trendy and data became... big, Hadoop became the de facto 5 | standard data processing system, and large Hadoop clusters were installed in 6 | many companies as "the" cluster. As application requirements evolved, users 7 | started abusing the large Hadoop in unintended ways. For example, users would 8 | submit map-only jobs which were thinly guised web servers. Apache Hadoop YARN 9 | is a cluster manager that aims to disentangle cluster management from 10 | programming paradigm and has the following goals: 11 | 12 | - Scalability 13 | - Multi-tenancy 14 | - Serviceability 15 | - Locality awareness 16 | - High cluster utilization 17 | - Reliability/availability 18 | - Secure and auditable operation 19 | - Support for programming model diversity 20 | - Flexible resource model 21 | - Backward compatibility 22 | 23 | YARN is orchestrated by a per-cluster *Resource Manager* (RM) that tracks 24 | resource usage and node liveness, enforces allocation invariants, and 25 | arbitrates contention among tenants. *Application Masters* (AM) are responsible 26 | for negotiating resources with the RM and manage the execution of single job. 27 | AMs send ResourceRequests to the RM telling it resource requirements, locality 28 | preferences, etc. In return, the RM hands out *containers* (e.g. <2GB RAM, 1 29 | CPU>) to AMs. The RM also communicates with Node Managers (NM) running on each 30 | node which are responsible for measuring node resources and managing (i.e. 31 | starting and killing) tasks. When a user want to submit a job, it sends it to 32 | the RM which hands a capability to an AM to present to an NM. The RM is a 33 | single point of failure. If it fails, it restores its state from disk and kills 34 | all running AMs. The AMs are trusted to be faul-tolerant and resubmit any 35 | prematurely terminated jobs. 36 | 37 | YARN is deployed at Yahoo where it manages roughly 500,000 daily jobs. YARN 38 | supports frameworks like Hadoop, Tez, Spark, Dryad, Giraph, and Storm. 39 | 40 | -------------------------------------------------------------------------------- /papers/verma2015large.md: -------------------------------------------------------------------------------- 1 | ## [Large-scale cluster management at Google with Borg (2015)](https://scholar.google.com/scholar?cluster=18268680833362692042&hl=en&as_sdt=0,5) 2 | **Summary.** 3 | Borg is Google's cluster manager. Users submit *jobs*, a collection of *tasks*, 4 | to Borg which are then run in a single *cell*, many of which live inside a 5 | single *cluster*. Borg jobs are either high priority latency-sensitive 6 | *production* jobs (e.g. user facing products and core infrastructure) or low 7 | priority *non-production* batch jobs. Jobs have typical properties like name 8 | and owner and can also express constraints (e.g. only run on certain 9 | architectures). Tasks also have properties and state their resource demands. 10 | Borg jobs are specified in BCL and are bundled as statically linked 11 | executables. Jobs are labeled with a priority and must operate within quota 12 | limits. Resources are bundled into *allocs* in which multiple tasks can run. 13 | Borg also manages a naming service, and exports a UI called Sigma to 14 | developers. 15 | 16 | Cells are managed by five-way replicated *Borgmasters*. A Borgmaster 17 | communicates with *Borglets* running on each machine via RPC, manages the Paxos 18 | replicated state of system, and exports information to Sigma. There is also a 19 | high fidelity borgmaster simulator known as the Fauxmaster which can used for 20 | debugging. 21 | 22 | One subcomponent of the Borgmaster handles scheduling. Submitted jobs are 23 | placed in a queue and scheduled by priority and round-robin within a priority. 24 | Each job undergoes feasibility checking where Borg checks that there are enough 25 | resources to run the job and then scoring where Borg determines the best place 26 | to run the job. Worst fit scheduling spreads jobs across many machines allowing 27 | for spikes in resource usage. Best fit crams jobs as closely as possible which 28 | is bad for bursty loads. Borg uses a scheduler which attempts to limit 29 | "stranded resources": resources on a machine which cannot be used because other 30 | resources on the same machine are depleted. Tasks that are preempted are placed 31 | back on the queue. Borg also tries to place jobs where their packages are 32 | already loaded, but offers no other form of locality. 33 | 34 | Borglets run on each machine and are responsible for starting and stopping 35 | tasks, managing logs, and reporting to the Borgmaster. The Borgmaster 36 | periodically polls the Borglets (as opposed to Borglets pushing to the 37 | Borgmaster) to avoid any need for flow control or recovery storms. 38 | 39 | The Borgmaster performs a couple of tricks to achieve high scalability. 40 | 41 | - The scheduler operates on slightly stale state, a form of "optimistic 42 | scheduling". 43 | - The Borgmaster caches job scores. 44 | - The Borgmaster performs feasibility checking and scoring for all equivalent 45 | jobs at once. 46 | - Complete scoring is hard, so the Borgmaster uses randomization. 47 | 48 | The Borgmaster puts the onus of fault tolerance on applications, expecting them 49 | to handle occasional failures. Still, the Borgmaster also performs a set of 50 | nice tricks for availability. 51 | 52 | - It reschedules evicted tasks. 53 | - It spreads tasks across failure domains. 54 | - It limits the number of tasks in a job that can be taken down due to 55 | maintenance. 56 | - Avoids past machine/task pairings that lead to failure. 57 | 58 | To measure cluster utilization, Google uses a *cell compaction* metric: the 59 | smallest a cell can be to run a given workload. Better utilization leads 60 | directly to savings in money, so Borg is very focused on improving utilization. 61 | For example, it allows non-production jobs to reclaim unused resources from 62 | production jobs. 63 | 64 | Borg uses containers for isolation. It also makes sure to throttle or kill jobs 65 | appropriately to ensure performance isolation. 66 | 67 | -------------------------------------------------------------------------------- /papers/vogels2009eventually.md: -------------------------------------------------------------------------------- 1 | # [Eventually Consistent (2009)](https://scholar.google.com/scholar?cluster=4308857796184904369) 2 | In this CACM article, Werner Vogels discusses eventual consistency as well as a 3 | couple other forms of consistency. 4 | 5 | ## Historical Perspective 6 | Back in the day, when people were thinking about how to build distributed 7 | systems, they thought that strong consistency was the only option. Anything 8 | else would just be incorrect, right? Well, fast forward to the 90's and 9 | availability started being favored over consistency. In 2000, Brewer released 10 | the CAP theorem unto the world, and weak consistency really took off. 11 | 12 | ## Client's Perspective of Consistency 13 | There is a zoo of consistency from the perspective of a client. 14 | 15 | - **Strong consistency.** A data store is strongly consistent if after a write 16 | completes, it is visible to all subsequent reads. 17 | - **Weak consistency.** Weak consistency is a synonym for no consistency. Your 18 | reads can return garbage. 19 | - **Eventual consistency.** "the storage system guarantees that if no new 20 | updates are made to the object, eventually all accesses will return the last 21 | updated value." 22 | - **Causal consistency.** If A issues a write and then contacts B, B's read 23 | will see the effect of A's write since it causally comes after it. Causally 24 | unrelated reads can return whatever they want. 25 | - **Read-your-writes consistency.** Clients read their most recent write or a 26 | more recent version. 27 | - **Session consistency.** So long as a client maintains a session, it gets 28 | read-your-writes consistency. 29 | - **Monotonic read consistency.** Over time, a client will see increasingly 30 | more fresh versions of data. 31 | - **Monotonic write consistency.** A client's writes will be executed in the 32 | order in which they are issued. 33 | 34 | ## Server's Perspective of Consistency 35 | Systems can implement consistency using quorums in which a write is sent to W 36 | of the N replicas, and a read is sent to R of the N replicas. If R + W > N, 37 | then we have strong consistency. 38 | -------------------------------------------------------------------------------- /papers/welsh2001seda.md: -------------------------------------------------------------------------------- 1 | ## [SEDA: An Architecture for Well-Conditioned, Scalable Internet Services (2001)](https://goo.gl/wrn04s) ## 2 | **Summary.** 3 | Writing distributed Internet applications is difficult because they have to 4 | serve a huge number of requests, and the number of requests is highly variable 5 | and bursty. Moreover, the applications themselves are also complicated pieces 6 | of code. This paper introduces the *staged event-driven architecture* (SEDA) 7 | which has the following goals. 8 | 9 | - *Massive concurrency.* Applications written using SEDA should be able to 10 | support a very large number of clients. 11 | - *Simplify the construction of well-conditioned services.* A 12 | *well-conditioned* service is one that gracefully degrades. Throughput 13 | increases with the number of clients until it saturates at some threshold. At 14 | this point, throughput remains constant and latency increases proportionally 15 | with the number of clients. SEDA is designed to make writing well-conditioned 16 | services easy. 17 | - *Enable introspection.* SEDA applications should be able to inspect and adapt 18 | to incoming request queues. Some request-per-thread architectures, for 19 | example, do not enable introspection. Control over thread scheduling is left 20 | completely to the OS; it cannot be adapted to the queue of incoming 21 | requests. 22 | - *Self-tuning resource management.* SEDA programmers should not have to tune 23 | knobs themselves. 24 | 25 | 26 | SEDA accomplishes these goals by structuring applications as a *network* of 27 | *event-driven stages* connected by *explicit message queues* and managed by 28 | *dynamic resource controllers*. That's a dense sentence, so let's elaborate. 29 | 30 | Threading based concurrency models have scalability limitations due to the 31 | overheads of context switching, poor caching, synchronization, etc. The 32 | [event-driven concurrency 33 | model](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-events.pdf) involves 34 | nothing more than a single-threaded loop that reads messages and processes 35 | them. It avoids many of the scalability limitations that threading models face. 36 | In SEDA, the atomic unit of execution is a *stage* and is implemented using an 37 | event-driven concurrency model. Each stage has an input queue of messages which 38 | are read in batches by a thread pool and processed by a user-provided event 39 | handler which can in turn write messages to other stages. 40 | 41 | A SEDA application is simply a network (i.e. graph) of interconnected stages. 42 | Notably, the input queue of every stage is finite. This means that when one 43 | stage tries to write data to the input queue of another stage, it may fail. 44 | When this happens, stages have to block (i.e. pushback), start dropping 45 | requests (i.e. load shedding), degrade service, deliver an error to the user, 46 | etc. 47 | 48 | To ensure SEDA applications are well-conditioned, various resource managers 49 | tune SEDA application parameters to ensure consistent performance. For example, 50 | the *thread pool controller* scales the number of threads within stages based 51 | on the number of messages in its input queue. Similarly, the *batching 52 | controller* adjusts the size of the batch delivered to each event handler. 53 | 54 | The authors developed a SEDA prototype in Java dubbed Sandstorm. As with all 55 | event-driven concurrency models, Sandstorm depends on asynchronous I/O 56 | libraries. It implements asynchronous network I/O as three stages using 57 | existing OS functionality (i.e. select/poll). It implements asynchronous file 58 | I/O using a dynamically resizable thread pool that issues synchronous calls; OS 59 | support for asynchronous file I/O was weak at the time. The authors evaluate 60 | Sandstorm by implementing and evaluating an HTTP server and Gnutella router. 61 | -------------------------------------------------------------------------------- /papers/zaharia2012resilient.md: -------------------------------------------------------------------------------- 1 | # [Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (2012)](TODO) ## 2 | Frameworks like MapReduce made processing large amounts of data easier, but 3 | they did not leverage distributed memory. If a MapReduce was run iteratively, 4 | it would write all of its intermediate state to disk: something that was 5 | prohibitively slow. This limitation made batch processing systems like 6 | MapReduce ill-suited to *iterative* (e.g. k-means clustering) and *interactive* 7 | (e.g. ad-hoc queries) workflows. Other systems like Pregel did take advantage 8 | of distributed memory and reused the in-memory data across computations, but 9 | the systems were not general-purpose. 10 | 11 | Spark uses **Resilient Distributed Datasets** (RDDs) to perform general 12 | computations in memory. RDDs are immutable partitioned collections of records. 13 | Unlike pure distributed shared memory abstractions which allow for arbitrary 14 | fine-grained writes, RDDs can only be constructed using coarse-grained 15 | transformations from on-disk data or other RDDs. This weaker abstraction can be 16 | implemented efficiently. Spark also uses RDD lineage to implement low-overhead 17 | fault tolerance. Rather than persist intermediate datasets, the lineage of an 18 | RDD can be persisted and efficiently recomputed. RDDs could also be 19 | checkpointed to avoid the recomputation of a long lineage graph. 20 | 21 | Spark has a Scala-integrated API and comes with a modified interactive 22 | interpreter. It also includes a large number of useful **transformations** 23 | (which construct RDDs) and **actions** (which derive data from RDDs). Users can 24 | also manually specify RDD persistence and partitioning to further improve 25 | performance. 26 | 27 | Spark subsumed a huge number of existing data processing frameworks like 28 | MapReduce and Pregel in a small amount of code. It was also much, much faster 29 | than everything else on a large number of applications. 30 | --------------------------------------------------------------------------------