├── .gitignore
├── Makefile
├── README.md
├── TODO.md
├── conferences
    ├── README.md
    ├── sigmod2016.md
    └── tapp.md
├── css
    ├── default_highlight.css
    └── style.css
├── footer.html
├── header.html
├── html
    ├── adya2000generalized.html
    ├── agarwal1996computation.html
    ├── agrawal1987concurrency.html
    ├── akidau2013millwheel.html
    ├── alvaro2010boom.html
    ├── alvaro2011consistency.html
    ├── alvaro2011dedalus.html
    ├── alvaro2015lineage.html
    ├── arasu2006cql.html
    ├── armbrust2015spark.html
    ├── avnur2000eddies.html
    ├── bailis2013highly.html
    ├── bailis2014coordination.html
    ├── balegas2015putting.html
    ├── barham2003xen.html
    ├── baumann2015shielding.html
    ├── beckmann1990r.html
    ├── berenson1995critique.html
    ├── bershad1995spin.html
    ├── brewer2005combining.html
    ├── brewer2012cap.html
    ├── brin1998anatomy.html
    ├── bugnion1997disco.html
    ├── burns2016borg.html
    ├── burrows2006chubby.html
    ├── cafarella2008webtables.html
    ├── carriero1994linda.html
    ├── chamberlin1981history.html
    ├── chang2008bigtable.html
    ├── chen2016realtime.html
    ├── cheney2009provenance.html
    ├── clark2005live.html
    ├── codd1970relational.html
    ├── conway2012logic.html
    ├── conway2014edelweiss.html
    ├── corbett2013spanner.html
    ├── crooks2016tardis.html
    ├── dan2017using.html
    ├── dean2008mapreduce.html
    ├── decandia2007dynamo.html
    ├── dewitt1990gamma.html
    ├── dewitt1992parallel.html
    ├── diaconu2013hekaton.html
    ├── elphinstone2013l3.html
    ├── engler1995exokernel.html
    ├── ghemawat2003google.html
    ├── ghodsi2011dominant.html
    ├── gilbert2002brewer.html
    ├── goldman1997dataguides.html
    ├── golub1992microkernel.html
    ├── gonzalez2012powergraph.html
    ├── gotsman2016cause.html
    ├── graefe1990encapsulation.html
    ├── graefe1993volcano.html
    ├── graefe2009five.html
    ├── gray1976granularity.html
    ├── green2013datalog.html
    ├── halevy2016goods.html
    ├── harinarayan1996implementing.html
    ├── hellerstein2007architecture.html
    ├── hellerstein2010declarative.html
    ├── hellerstein2012madlib.html
    ├── herlihy1990linearizability.html
    ├── hindman2011mesos.html
    ├── holt2016disciplined.html
    ├── isard2007dryad.html
    ├── kandel2011wrangler.html
    ├── kapritsos2012all.html
    ├── klonatos2014building.html
    ├── kohler2000click.html
    ├── kohler2012declarative.html
    ├── kornacker2015impala.html
    ├── kulkarni2015twitter.html
    ├── kung1981optimistic.html
    ├── lagar2009snowflock.html
    ├── lampson1980experience.html
    ├── lauer1979duality.html
    ├── lehman1981efficient.html
    ├── lehman1999t.html
    ├── letia2009crdts.html
    ├── li2012making.html
    ├── li2014automating.html
    ├── li2014scaling.html
    ├── lin2016towards.html
    ├── liu1973scheduling.html
    ├── lloyd2011don.html
    ├── madden2002tag.html
    ├── maddox2016decibel.html
    ├── mckeen2013innovative.html
    ├── mckusick1984fast.html
    ├── miller2000schema.html
    ├── mohan1986transaction.html
    ├── moore2006inferring.html
    ├── murray2013naiad.html
    ├── o1986escrow.html
    ├── o1997improved.html
    ├── ongaro2014search.html
    ├── prabhakaran2005analysis.html
    ├── quigley2009ros.html
    ├── recht2011hogwild.html
    ├── ritchie1978unix.html
    ├── roy2015homeostasis.html
    ├── saltzer1984end.html
    ├── selinger1979access.html
    ├── shapiro2011conflict.html
    ├── sigelman2010dapper.html
    ├── stoica2001chord.html
    ├── stonebraker1987design.html
    ├── stonebraker1991postgres.html
    ├── stonebraker2005c.html
    ├── taft2014store.html
    ├── terry1995managing.html
    ├── terry2013replicated.html
    ├── thomson2012calvin.html
    ├── toshniwal2014storm.html
    ├── tu2013speedy.html
    ├── van2004chain.html
    ├── vavilapalli2013apache.html
    ├── verma2015large.html
    ├── vogels2009eventually.html
    ├── waldspurger1994lottery.html
    ├── welsh2001seda.html
    ├── wilkes1996hp.html
    ├── yu2008dryadlinq.html
    ├── zaharia2012resilient.html
    ├── zaharia2013discretized.html
    ├── zhou2010efficient.html
    ├── zhou2011tap.html
    └── zhou2012distributed.html
├── images
    ├── dan2017using_fork.png
    └── dan2017using_zigzag.png
├── index.html
├── js
    ├── highlight.pack.js
    └── mathjax_config.js
└── papers
    ├── adya2000generalized.md
    ├── agarwal1996computation.md
    ├── agrawal1987concurrency.md
    ├── akidau2013millwheel.md
    ├── alvaro2010boom.md
    ├── alvaro2011consistency.md
    ├── alvaro2011dedalus.md
    ├── alvaro2015lineage.md
    ├── arasu2006cql.md
    ├── armbrust2015spark.md
    ├── avnur2000eddies.md
    ├── bailis2013highly.md
    ├── bailis2014coordination.md
    ├── balegas2015putting.md
    ├── barham2003xen.md
    ├── baumann2015shielding.md
    ├── beckmann1990r.md
    ├── berenson1995critique.md
    ├── bershad1995spin.md
    ├── brewer2005combining.md
    ├── brewer2012cap.md
    ├── brin1998anatomy.md
    ├── bugnion1997disco.md
    ├── burns2016borg.md
    ├── burrows2006chubby.md
    ├── cafarella2008webtables.md
    ├── carriero1994linda.md
    ├── chamberlin1981history.md
    ├── chang2008bigtable.md
    ├── chen2016realtime.md
    ├── cheney2009provenance.md
    ├── clark2005live.md
    ├── codd1970relational.md
    ├── conway2012logic.md
    ├── conway2014edelweiss.md
    ├── corbett2013spanner.md
    ├── crooks2016tardis.md
    ├── dan2017using.md
    ├── dean2008mapreduce.md
    ├── decandia2007dynamo.md
    ├── dewitt1990gamma.md
    ├── dewitt1992parallel.md
    ├── diaconu2013hekaton.md
    ├── elphinstone2013l3.md
    ├── engler1995exokernel.md
    ├── ghemawat2003google.md
    ├── ghodsi2011dominant.md
    ├── gilbert2002brewer.md
    ├── goldman1997dataguides.md
    ├── golub1992microkernel.md
    ├── gonzalez2012powergraph.md
    ├── gotsman2016cause.md
    ├── graefe1990encapsulation.md
    ├── graefe1993volcano.md
    ├── graefe2009five.md
    ├── gray1976granularity.md
    ├── green2013datalog.md
    ├── halevy2016goods.md
    ├── harinarayan1996implementing.md
    ├── hellerstein2007architecture.md
    ├── hellerstein2010declarative.md
    ├── hellerstein2012madlib.md
    ├── herlihy1990linearizability.md
    ├── hindman2011mesos.md
    ├── holt2016disciplined.md
    ├── isard2007dryad.md
    ├── kandel2011wrangler.md
    ├── kapritsos2012all.md
    ├── klonatos2014building.md
    ├── kohler2000click.md
    ├── kohler2012declarative.md
    ├── kornacker2015impala.md
    ├── kulkarni2015twitter.md
    ├── kung1981optimistic.md
    ├── lagar2009snowflock.md
    ├── lampson1980experience.md
    ├── lauer1979duality.md
    ├── lehman1981efficient.md
    ├── lehman1999t.md
    ├── letia2009crdts.md
    ├── li2012making.md
    ├── li2014automating.md
    ├── li2014scaling.md
    ├── lin2016towards.md
    ├── liu1973scheduling.md
    ├── lloyd2011don.md
    ├── madden2002tag.md
    ├── maddox2016decibel.md
    ├── mckeen2013innovative.md
    ├── mckusick1984fast.md
    ├── miller2000schema.md
    ├── mohan1986transaction.md
    ├── moore2006inferring.md
    ├── murray2013naiad.md
    ├── o1986escrow.md
    ├── o1997improved.md
    ├── ongaro2014search.md
    ├── prabhakaran2005analysis.md
    ├── quigley2009ros.md
    ├── recht2011hogwild.md
    ├── ritchie1978unix.md
    ├── roy2015homeostasis.md
    ├── saltzer1984end.md
    ├── selinger1979access.md
    ├── shapiro2011conflict.md
    ├── sigelman2010dapper.md
    ├── stoica2001chord.md
    ├── stonebraker1987design.md
    ├── stonebraker1991postgres.md
    ├── stonebraker2005c.md
    ├── taft2014store.md
    ├── terry1995managing.md
    ├── terry2013replicated.md
    ├── thomson2012calvin.md
    ├── toshniwal2014storm.md
    ├── tu2013speedy.md
    ├── van2004chain.md
    ├── vavilapalli2013apache.md
    ├── verma2015large.md
    ├── vogels2009eventually.md
    ├── waldspurger1994lottery.md
    ├── welsh2001seda.md
    ├── wilkes1996hp.md
    ├── yu2008dryadlinq.md
    ├── zaharia2012resilient.md
    ├── zaharia2013discretized.md
    ├── zhou2010efficient.md
    ├── zhou2011tap.md
    └── zhou2012distributed.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | # swap
 2 | [._]*.s[a-w][a-z]
 3 | [._]s[a-w][a-z]
 4 | # session
 5 | Session.vim
 6 | # temporary
 7 | .netrwhist
 8 | *~
 9 | # auto-generated tag files
10 | tags
11 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | PAPERS = $(wildcard papers/*.md)
 2 | HTMLS  = $(subst papers/,html/,$(patsubst %.md,%.html,$(PAPERS)))
 3 | 
 4 | default: $(HTMLS)
 5 | 
 6 | html/%.html: papers/%.md header.html footer.html
 7 | 	cat header.html > $@
 8 | 	pandoc --from markdown-tex_math_dollars-raw_tex --to html --ascii $< >> $@
 9 | 	cat footer.html >> $@
10 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # [Papers](https://mwhittaker.github.io/papers)
 2 | This repository is a compendium of notes on papers I've read.
 3 | 
 4 | ## Getting Started
 5 | If you want to read the paper summaries, simply go
 6 | [here](https://mwhittaker.github.io/papers). If you want to add a paper
 7 | summary, or clone this repo and create summaries of your own, here's how
 8 | everything works.
 9 | 
10 | - First, add your summary as a markdown file in [the `papers`
11 |   directory](papers/).
12 | - Run `make` to convert each markdown file `foo.md` into an HTML file
13 |   `foo.html` using `pandoc`.
14 | - Then, update [`index.html`](index.html) with a link to the HTML file (e.g.
15 |   `foo.html`)
16 | 
17 | It's as simple as that. Oh, and if you want to include MathJax in your summary,
18 | add the following to the bottom of the markdown file:
19 | 
20 | ```html
21 | <script type="text/javascript" async
22 |   src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML">
23 | </script>
24 | ```
25 | 
26 | If you want to add syntax highlighting, add this:
27 | 
28 | ```html
29 | <link href='../css/default_highlight.css' rel='stylesheet'>
30 | <script src="../js/highlight.pack.js"></script>
31 | <script>hljs.initHighlightingOnLoad();</script>
32 | ```
33 | 


--------------------------------------------------------------------------------
/TODO.md:
--------------------------------------------------------------------------------
 1 | - [x] `codd1970relational.md`
 2 | - [x] `liu1973scheduling.md`
 3 | - [x] `ritchie1978unix.md`
 4 | - [x] `gray1976granularity.md`
 5 | - [x] `lauer1979duality.md`
 6 | - [x] `lampson1980experience.md`
 7 | - [x] `chamberlin1981history.md`
 8 | - [x] `mckusick1984fast.md`
 9 | - [x] `saltzer1984end.md`
10 | - [x] `agrawal1987concurrency.md`
11 | - [x] `stonebraker1987design.md`
12 | - [x] `herlihy1990linearizability.md`
13 | - [x] `stonebraker1991postgres.md`
14 | - [x] `golub1992microkernel.md`
15 | - [ ] `carriero1994linda.md`
16 | - [ ] `waldspurger1994lottery.md`
17 | - [ ] `berenson1995critique.md`
18 | - [ ] `engler1995exokernel.md`
19 | - [ ] `bershad1995spin.md`
20 | - [ ] `wilkes1996hp.md`
21 | - [ ] `bugnion1997disco.md`
22 | - [ ] `lehman1999t.md`
23 | - [ ] `adya2000generalized.md`
24 | - [ ] `kohler2000click.md`
25 | - [ ] `stoica2001chord.md`
26 | - [ ] `moore2006inferring.md`
27 | - [ ] `welsh2001seda.md`
28 | - [ ] `gilbert2002brewer.md`
29 | - [ ] `ghemawat2003google.md`
30 | - [ ] `barham2003xen.md`
31 | - [ ] `van2004chain.md`
32 | - [ ] `dean2008mapreduce.md`
33 | - [ ] `prabhakaran2005analysis.md`
34 | - [ ] `clark2005live.md`
35 | - [ ] `burrows2006chubby.md`
36 | - [ ] `hellerstein2007architecture.md`
37 | - [ ] `isard2007dryad.md`
38 | - [ ] `decandia2007dynamo.md`
39 | - [ ] `chang2008bigtable.md`
40 | - [ ] `yu2008dryadlinq.md`
41 | - [ ] `letia2009crdts.md`
42 | - [ ] `graefe2009five.md`
43 | - [ ] `lagar2009snowflock.md`
44 | - [ ] `alvaro2010boom.md`
45 | - [ ] `hellerstein2010declarative.md`
46 | - [ ] `shapiro2011conflict.md`
47 | - [ ] `alvaro2011consistency.md`
48 | - [ ] `alvaro2011dedalus.md`
49 | - [ ] `ghodsi2011dominant.md`
50 | - [ ] `hindman2011mesos.md`
51 | - [ ] `lloyd2011don.md`
52 | - [ ] `conway2012logic.md`
53 | - [ ] `li2012making.md`
54 | - [ ] `zaharia2012resilient.md`
55 | - [ ] `vavilapalli2013apache.md`
56 | - [ ] `zaharia2013discretized.md`
57 | - [ ] `elphinstone2013l3.md`
58 | - [ ] `mckeen2013innovative.md`
59 | - [ ] `akidau2013millwheel.md`
60 | - [ ] `murray2013naiad.md`
61 | - [ ] `terry2013replicated.md`
62 | - [ ] `li2014automating.md`
63 | - [ ] `bailis2014coordination.md`
64 | - [ ] `conway2014edelweiss.md`
65 | - [ ] `bailis2013highly.md`
66 | - [ ] `ongaro2014search.md`
67 | - [ ] `baumann2015shielding.md`
68 | - [ ] `toshniwal2014storm.md`
69 | - [ ] `roy2015homeostasis.md`
70 | - [ ] `kornacker2015impala.md`
71 | - [ ] `verma2015large.md`
72 | - [ ] `balegas2015putting.md`
73 | - [ ] `armbrust2015spark.md`
74 | - [ ] `kulkarni2015twitter.md`
75 | - [ ] `burns2016borg.md`
76 | - [ ] `gotsman2016cause.md`
77 | - [ ] `maddox2016decibel.md`
78 | - [ ] `holt2016disciplined.md`
79 | - [ ] `halevy2016goods.md`
80 | - [ ] `chen2016realtime.md`
81 | - [ ] `crooks2016tardis.md`
82 | 


--------------------------------------------------------------------------------
/conferences/README.md:
--------------------------------------------------------------------------------
 1 | - [ ] CIDR
 2 | - [ ] SIGMOD
 3 | - [ ] SOCC
 4 | - [ ] VLDB
 5 | - [ ] EuroSys
 6 | - [ ] NSDI
 7 | - [ ] OSDI
 8 | - [ ] USENIX ATC
 9 | - [ ] TODS
10 | - [ ] VLDBJ
11 | 


--------------------------------------------------------------------------------
/conferences/sigmod2016.md:
--------------------------------------------------------------------------------
 1 | ### Session 12: Distributed Data Processing
 2 | 1. **Realtime Data Processing at Facebook** Facebook describes five design
 3 |    decisions faced when building a stream processing system, what decisions
 4 |    existing stream processing system made, and how their stream processing
 5 |    systems (Puma, Swift, and Stylus) differ.
 6 | 2. **SparkR: Scaling R Programs with Spark** R is convenient but slow. This
 7 |    paper presents an R package that provides a frontend to Apache Spark to
 8 |    allow large scale data analysis from within R.
 9 | 3. **VectorH: taking SQL-on-Hadoop to the next level** VectorH is a
10 |    SQL-on-Hadoop system built on the Vectorwise database. VectorH builds on
11 |    HDFS and YARN and supports ordered tables uing Positional Delta Trees. It is
12 |    orders of magnitude faster than existing SQL-on-Hadoop systems HAWQ, Impala,
13 |    SparkSQL, and Hive.
14 | 4. **Adaptive Logging: Optimizing Logging and Recovery Costs in Distributed
15 |    In-memory Databases** Some memory databases have eschewed ARIES-like data
16 |    logging for command logging to reduce the size of logs. However, this
17 |    increases recovery time. This paper improves command logging allowing nodes
18 |    to recover in parallel and also introduces an adaptive logging scheme that
19 |    uses both data and command logging.
20 | 5. **Big Data Analytics with Datalog Queries on Spark** This paper presents
21 |    BigDatalog: a system which runs Datalog efficiently using Apache Spark by
22 |    exploiting compilation and optimization techniques.
23 | 6. **An Efficient MapReduce Cube Algorithm for Varied Data Distributions** This
24 |    paper presents an algorithm that computes Data cubes using MapReduce that
25 |    can tolerate skewed data distributions using a Skews and Partitions Sketch
26 |    data structure.
27 | 
28 | ### Session 18: Transactions and Consistency
29 | 1. **TARDiS: A Branch-and-Merge Approach To Weak Consistency**
30 |    TARDiS is a transactional key-value store explicitly designed with weak
31 |    consistency in mind. TARDiS exposes the set of conflicting branches in an
32 |    eventually consistent system and allows clients to merge when desired.
33 | 2. **TicToc: Time Traveling Optimistic Concurrency Control** TicToc is a new
34 |    timestamp management protocol that assigns read and write timestamps to data
35 |    items and lazily computes commit timestamps.
36 | 3. **Scaling Multicore Databases via Constrained Parallel Execution** 2PL and
37 |    OCC sacrifice parallelism in high contention workloads. This paper presents
38 |    a new concurrency control scheme: interleaving constrained concurrency
39 |    control (IC3). IC3 uses static analysis and runtime techniques.
40 | 4. **Towards a Non-2PC Transaction Management in Distributed Database Systems**
41 |    Shared nothing transactional databases have fast local transactions but slow
42 |    distributed transactions due to 2PC. This paper introduces the LEAP
43 |    transaction management scheme that converts distributed transactions into
44 |    local ones. Their database L-store is compared against H-store.
45 | 5. **ERMIA: Fast memory-optimized database system for heterogeneous workloads**
46 |    ERMIA is a database that optimizes heterogeneous read-mostly workloads using
47 |    snapshot isolation that performs better than traditional OCC systems.
48 | 6. **Transaction Healing: Scaling Optimistic Concurrency Control on Multicores**
49 |    Multicore OCC databases don't scale well with high contention workloads.
50 |    This paper presents a new concurrency control mechanism known as transaction
51 |    healing where dependencies are analyzed ahead of time and transactions are
52 |    not full-blown aborted when validation fails. Transaction healing is
53 |    implemented in TheDB.
54 | 


--------------------------------------------------------------------------------
/css/default_highlight.css:
--------------------------------------------------------------------------------
  1 | /*
  2 | 
  3 | Original highlight.js style (c) Ivan Sagalaev <maniac@softwaremaniacs.org>
  4 | 
  5 | */
  6 | 
  7 | .hljs {
  8 |   /* display: block; */
  9 |   /* overflow-x: auto; */
 10 |   /* padding: 0.5em; */
 11 |   /* background: #F0F0F0; */
 12 | }
 13 | 
 14 | 
 15 | /* Base color: saturation 0; */
 16 | 
 17 | .hljs,
 18 | .hljs-subst {
 19 |   color: #444;
 20 | }
 21 | 
 22 | .hljs-comment {
 23 |   color: #888888;
 24 | }
 25 | 
 26 | .hljs-keyword,
 27 | .hljs-attribute,
 28 | .hljs-selector-tag,
 29 | .hljs-meta-keyword,
 30 | .hljs-doctag,
 31 | .hljs-name {
 32 |   font-weight: bold;
 33 | }
 34 | 
 35 | 
 36 | /* User color: hue: 0 */
 37 | 
 38 | .hljs-type,
 39 | .hljs-string,
 40 | .hljs-number,
 41 | .hljs-selector-id,
 42 | .hljs-selector-class,
 43 | .hljs-quote,
 44 | .hljs-template-tag,
 45 | .hljs-deletion {
 46 |   color: #880000;
 47 | }
 48 | 
 49 | .hljs-title,
 50 | .hljs-section {
 51 |   color: #880000;
 52 |   font-weight: bold;
 53 | }
 54 | 
 55 | .hljs-regexp,
 56 | .hljs-symbol,
 57 | .hljs-variable,
 58 | .hljs-template-variable,
 59 | .hljs-link,
 60 | .hljs-selector-attr,
 61 | .hljs-selector-pseudo {
 62 |   color: #BC6060;
 63 | }
 64 | 
 65 | 
 66 | /* Language color: hue: 90; */
 67 | 
 68 | .hljs-literal {
 69 |   color: #78A960;
 70 | }
 71 | 
 72 | .hljs-built_in,
 73 | .hljs-bullet,
 74 | .hljs-code,
 75 | .hljs-addition {
 76 |   color: #397300;
 77 | }
 78 | 
 79 | 
 80 | /* Meta color: hue: 200 */
 81 | 
 82 | .hljs-meta {
 83 |   color: #1f7199;
 84 | }
 85 | 
 86 | .hljs-meta-string {
 87 |   color: #4d99bf;
 88 | }
 89 | 
 90 | 
 91 | /* Misc effects */
 92 | 
 93 | .hljs-emphasis {
 94 |   font-style: italic;
 95 | }
 96 | 
 97 | .hljs-strong {
 98 |   font-weight: bold;
 99 | }
100 | 


--------------------------------------------------------------------------------
/css/style.css:
--------------------------------------------------------------------------------
  1 | html {
  2 |   font-family: "Helvetica Neue",Helvetica,Arial,sans-serif;
  3 |   background: #FEFEFE;
  4 | }
  5 | 
  6 | body {
  7 |   color: #333;
  8 |   margin: 0;
  9 |   padding: 0;
 10 |   font-size: 18px;
 11 | }
 12 | 
 13 | #header {
 14 |   background-color: #f44336;
 15 |   color: white;
 16 |   font-size: 24pt;
 17 |   font-weight: bold;
 18 |   padding-bottom: 10px;
 19 |   padding-top: 10px;
 20 |   text-align: center;
 21 | }
 22 | 
 23 | #header a {
 24 |   color: white;
 25 |   text-decoration: none;
 26 | }
 27 | 
 28 | #container {
 29 |   padding-left: 1em;
 30 |   padding-right: 1em;
 31 |   max-width: 44em;
 32 |   line-height: 27px;
 33 |   padding-top: 20px;
 34 |   padding-bottom: 20px;
 35 |   margin-left: auto;
 36 |   margin-right: auto;
 37 | }
 38 | 
 39 | @media screen and (min-width: 600px) {
 40 |   #container {
 41 |     padding-left: 28px;
 42 |     padding-right: 28px;
 43 |   }
 44 | }
 45 | 
 46 | #indextitle {
 47 |   font-size: 60pt;
 48 |   margin-top: 0pt;
 49 |   margin-bottom: 0pt;
 50 | }
 51 | 
 52 | .year {
 53 |   color: gray;
 54 | }
 55 | 
 56 | a { text-decoration: none; }
 57 | h1 a:link, h1 a:visited { color: #222; }
 58 | h1 a:active, h1 a:hover { color: #444; }
 59 | #paperlist a:link, #paperlist a:visited { color: #222; }
 60 | #paperlist a:active, #paperlist a:hover { color: #444; }
 61 | 
 62 | h1 {
 63 |   line-height: 1;
 64 |   margin-top: 0em;
 65 |   margin-bottom: 0.55em;
 66 | }
 67 | 
 68 | h2 {
 69 |   margin-top: 0.9em;
 70 |   margin-bottom: 0.45em;
 71 | }
 72 | 
 73 | h3 {
 74 |   margin-top: 0.8em;
 75 |   margin-bottom: 0.35em;
 76 | }
 77 | 
 78 | h4 {
 79 |   margin-top: 0.7em;
 80 |   margin-bottom: 0.25em;
 81 | }
 82 | 
 83 | p {
 84 |   margin-top: 0em;
 85 |   margin-bottom: 1.15em;
 86 | }
 87 | 
 88 | pre {
 89 |   margin-top: 0em;
 90 |   margin-bottom: 1.15em;
 91 |   background-color: #f7f7f7;
 92 |   overflow: auto;
 93 |   padding: 5pt;
 94 | }
 95 | 
 96 | @media screen and (min-width: 600px) {
 97 |   pre {
 98 |     padding: 16pt;
 99 |   }
100 | }
101 | 
102 | code {
103 |   font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;
104 | }
105 | 
106 | table {
107 |   margin-bottom: 1.15em;
108 |   border-collapse: collapse;
109 | }
110 | 
111 | td, th {
112 |   padding: 2pt;
113 |   border: 1.25pt solid gray;
114 | }
115 | 
116 | blockquote {
117 |   padding-left: 1em;
118 |   color: #777;
119 |   border-left: 0.25em solid #ddd;
120 | }
121 | 
122 | .math {
123 |   overflow: auto;
124 | }
125 | 


--------------------------------------------------------------------------------
/footer.html:
--------------------------------------------------------------------------------
 1 |   </div>
 2 | 
 3 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
 4 | 	<script>
 5 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
 6 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
 7 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
 8 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
 9 | 
10 | 		ga('create', 'UA-90310997-2', 'auto');
11 | 		ga('send', 'pageview');
12 | 	</script>
13 | </body>
14 | </html>
15 | 


--------------------------------------------------------------------------------
/header.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | 


--------------------------------------------------------------------------------
/html/adya2000generalized.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="generalized-isolation-level-definitions-2000"><a href="TODO">Generalized Isolation Level Definitions (2000)</a></h2>
15 | <p><strong>Summary.</strong> In addition to serializability, ANSI SQL-92 defined a set of weaker isolation levels that applications could use to improve performance at the cost of consistency. The definitions were implementation-independent but ambiguous. Berenson et al. proposed a revision of the isolation level definitions that was unambiguous but specific to locking. Specifically, they define a set of <em>phenomena</em>:</p>
16 | <ul>
17 | <li>P0: <code>w1(x) ... w2(x) ...</code> <em>&quot;dirty write&quot;</em></li>
18 | <li>P1: <code>w1(x) ... r2(x) ...</code> <em>&quot;dirty read&quot;</em></li>
19 | <li>P2: <code>r1(x) ... w2(x) ...</code> <em>&quot;unrepeatable read&quot;</em></li>
20 | <li>P3: <code>r1(P) ... w2(y in P) ...</code> <em>&quot;phantom read&quot;</em></li>
21 | </ul>
22 | <p>and define the isolation levels according to which phenomena they preclude. This preclusion can be implemented by varying how long certain types of locks are held:</p>
23 | <table>
24 | <thead>
25 | <tr class="header">
26 | <th>write locks</th>
27 | <th>read locks</th>
28 | <th>phantom locks</th>
29 | <th>precluded</th>
30 | </tr>
31 | </thead>
32 | <tbody>
33 | <tr class="odd">
34 | <td>short</td>
35 | <td>short</td>
36 | <td>short</td>
37 | <td>P0</td>
38 | </tr>
39 | <tr class="even">
40 | <td>long</td>
41 | <td>short</td>
42 | <td>short</td>
43 | <td>P0, P1</td>
44 | </tr>
45 | <tr class="odd">
46 | <td>long</td>
47 | <td>long</td>
48 | <td>short</td>
49 | <td>P0, P1, P2</td>
50 | </tr>
51 | <tr class="even">
52 | <td>long</td>
53 | <td>long</td>
54 | <td>long</td>
55 | <td>P0, P1, P2, P3</td>
56 | </tr>
57 | </tbody>
58 | </table>
59 | <p>This locking-specific <em>preventative</em> approach to defining isolation levels, while unambiguous, rules out many non-locking implementations of concurrency control. Notably, it does not allow for multiversioning and does not allow non-committed transactions to experience weaker consistency than committed transactions. Moreover, many isolation levels are naturally expressed as invariants between multiple objects, but these definitions are all over a single object.</p>
60 | <p>This paper introduces implementation-independent unambiguous isolation level definitions. The definitions also include notions of predicates at all levels. It does so by first introducing the definition of a <em>history</em> as a partial order of read/write/commit/abort events and total order of commited object versions. It then introduces three dependencies: <em>read-dependencies</em>, <em>anti-dependencies</em>, and <em>write-dependencies</em> (also known as write-read, read-write, and write-write dependencies). Next, it describes how to construct dependency graph and defines isolation levels as constraints on these graphs.</p>
61 | <p>For example, the G0 phenomenon says that a dependency graph contains a write-dependency cycle. PL-1 is the isolation level that precludes G0. Similarly, the G1 phenomenon says that either</p>
62 | <ol style="list-style-type: decimal">
63 | <li>a committed transaction reads an aborted value,</li>
64 | <li>a committed transaction reads an intermediate value, or</li>
65 | <li>there is a write-read/write-write cycle.</li>
66 | </ol>
67 | <p>The PL-2 isolation level precludes G1 (and therefore G0) and corresponds roughly to the READ-COMMITTED isolation level.</p>
68 |   </div>
69 | 
70 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
71 | 	<script>
72 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
73 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
74 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
75 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
76 | 
77 | 		ga('create', 'UA-90310997-2', 'auto');
78 | 		ga('send', 'pageview');
79 | 	</script>
80 | </body>
81 | </html>
82 | 


--------------------------------------------------------------------------------
/html/alvaro2010boom.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="boom-analytics-exploring-data-centric-declarative-programming-for-the-cloud-2010"><a href="TODO">BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud (2010)</a></h2>
15 | <p><strong>Summary.</strong> Programming distributed systems is hard. Really hard. This paper conjectures that <em>data-centric</em> programming done with <em>declarative programming languages</em> can lead to simpler distributed systems that are more correct with less code. To support this conjecture, the authors implement an HDFS and Hadoop clone in Overlog, dubbed BOOM-FS and BOOM-MR respectively, using orders of magnitude fewer lines of code that the original implementations. They also extend BOOM-FS with increased availability, scalability, and monitoring.</p>
16 | <p>An HDFS cluster consists of NameNodes responsible for metadata management and DataNodes responsible for data management. BOOM-FS reimplements the metadata protocol in Overlog; the data protocol is implemented in Java. The implementation models the entire system state (e.g. files, file paths, heartbeats, etc.) as data in a unified way by storing them as collections. The Overlog implementation of the NameNode then operates on and updates these collections. Some of the data (e.g. file paths) are actually views that can be optionally materialized and incrementally maintained. After reaching (almost) feature parity with HDFS, the authors increased the availability of the NameNode by using Paxos to introduce a hot standby replicas. They also partition the NameNode metadata to increase scalability and use metaprogramming to implement monitoring.</p>
17 | <p>BOOM-MR plugs in to the existing Hadoop code but reimplements two MapReduce scheduling algorithms: Hadoop's first-come first-server algorithm, and Zaharia's LATE policy.</p>
18 | <p>BOOM Analytics was implemented in order of magnitude fewer lines of code thanks to the data-centric approach and the declarative programming language. The implementation is also almost as fast as the systems they copy.</p>
19 |   </div>
20 | 
21 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
22 | 	<script>
23 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
24 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
25 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
26 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
27 | 
28 | 		ga('create', 'UA-90310997-2', 'auto');
29 | 		ga('send', 'pageview');
30 | 	</script>
31 | </body>
32 | </html>
33 | 


--------------------------------------------------------------------------------
/html/alvaro2011consistency.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="consistency-analysis-in-bloom-a-calm-and-collected-approach-2011"><a href="https://scholar.google.com/scholar?cluster=9165311711752272482&amp;hl=en&amp;as_sdt=0,5">Consistency Analysis in Bloom: a CALM and Collected Approach (2011)</a></h2>
15 | <p><strong>Summary.</strong> Strong consistency eases reasoning about distributed systems, but it requires coordination which entails higher latency and unavailability. Adopting weaker consistency models can improve system performance but writing applications against these weaker guarantees can be nuanced, and programmers don't have any reasoning tools at their disposal. This paper introduces the <em>CALM conjecture</em> and <em>Bloom</em>, a disorderly declarative programming language based on CALM, which allows users to write loosely consistent systems in a more principled way.</p>
16 | <p>Say we've eschewed strong consistency, how do we know our program is even eventually consistent? For example, consider a distributed register in which servers accept writes on a first come first serve basis. Two clients could simultaneously write two distinct values <code>x</code> and <code>y</code> concurrently. One server may receive <code>x</code> then <code>y</code>; the other <code>y</code> then <code>x</code>. This system is not eventually consistent. Even after client requests have quiesced, the distributed register is in an inconsistent state. The <em>CALM conjecture</em> embraces Consistency As Logical Monotonicity and says that logically monotonic programs are eventually consistent for any ordering and interleaving of message delivery and computation. Moreover, they do not require waiting or coordination to stream results to clients.</p>
17 | <p><em>Bud</em> (Bloom under development) is the realization of the CALM theorem. It is Bloom implemented as a Ruby DSL. Users declare a set of Bloom collections (e.g. persistent tables, temporary tables, channels) and an order independent set of declarative statements similar to Datalog. Viewing Bloom through the lens of its procedural semantics, Bloom execution proceeds in timesteps, and each timestep is divided into three phases. First, external messages are delivered to a node. Second, the Bloom statements are evaluated. Third, messages are sent off elsewhere. Bloom also supports modules and interfaces to improve modularity.</p>
18 | <p>The paper also implements a key-value store and shopping cart using Bloom and uses various visualization tools to guide the design of coordination-free implementations.</p>
19 |   </div>
20 | 
21 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
22 | 	<script>
23 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
24 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
25 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
26 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
27 | 
28 | 		ga('create', 'UA-90310997-2', 'auto');
29 | 		ga('send', 'pageview');
30 | 	</script>
31 | </body>
32 | </html>
33 | 


--------------------------------------------------------------------------------
/html/armbrust2015spark.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="spark-sql-relational-data-processing-in-spark-2015"><a href="https://scholar.google.com/scholar?cluster=12543149035101013955&amp;hl=en&amp;as_sdt=0,5">Spark SQL: Relational Data Processing in Spark (2015)</a></h2>
15 | <p><strong>Summary.</strong> Data processing frameworks like MapReduce and Spark can do things that relational databases can't do very easily. For example, they can operate over semi-structured or unstructured data, and they can perform advanced analytics. On the other hand, Spark's API allows user to run arbitrary code (e.g. <code>rdd.map(some_arbitrary_function)</code>) which prevents Spark from performing certain optimizations. Spark SQL marries imperative Spark-like data processing with declarative SQL-like data processing into a single unified interface.</p>
16 | <p>Spark's main abstraction was an RDD. Spark SQL's main abstraction is a <em>DataFrame</em>: the Spark analog of a table which supports a nested data model of standard SQL types as well as structs, arrays, maps, unions, and user defined types. DataFrames can be manipulated as if they were RDDs of row objects (e.g. <code>dataframe.map(row_func)</code>), but they also support a set of standard relational operators which take ASTs, built using a DSL, as arguments. For example, the code <code>users.where(users(&quot;age&quot;) &lt; 40)</code> constructs an AST from <code>users(&quot;age&quot;) &lt; 40</code> as an argument to filter the <code>users</code> DataFrame. By passing in ASTs as arguments rather than arbitrary user code, Spark is able to perform optimizations it previously could not do. DataFrames can also be queries using SQL.</p>
17 | <p>Notably, integrating queries into an existing programming language (e.g. Scala) makes writing queries much easier. Intermediate subqueries can be reused, queries can be constructed using standard control flow, etc. Moreover, Spark eagerly typechecks queries even though their execution is lazy. Furthermore, Spark SQL allows users to create DataFrames of language objects (e.g. Scala objects), and UDFs are just normal Scala functions.</p>
18 | <p>DataFrame queries are optimized and manipulated by a new extensible query optimizer called <em>Catalyst</em>. The query optimizer manipulates ASTs written in Scala using <em>rules</em>, which are just functions from trees to trees that typically use pattern matching. Queries are optimized in four phases:</p>
19 | <ol style="list-style-type: decimal">
20 | <li><em>Analysis.</em> First, relations and columns are resolved, queries are typechecked, etc.</li>
21 | <li><em>Logical optimization.</em> Typical logical optimizations like constant folding, filter pushdown, boolean expression simplification, etc are performed.</li>
22 | <li><em>Physical planning.</em> Cost based optimization is performed.</li>
23 | <li><em>Code generation.</em> Scala quasiquoting is used for code generation.</li>
24 | </ol>
25 | <p>Catalyst also makes it easy for people to add new data sources and user defined types.</p>
26 | <p>Spark SQL also supports schema inference, ML integration, and query federation: useful features for big data.</p>
27 |   </div>
28 | 
29 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
30 | 	<script>
31 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
32 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
33 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
34 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
35 | 
36 | 		ga('create', 'UA-90310997-2', 'auto');
37 | 		ga('send', 'pageview');
38 | 	</script>
39 | </body>
40 | </html>
41 | 


--------------------------------------------------------------------------------
/html/bailis2013highly.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="highly-available-transactions-virtues-and-limitations-2014"><a href="TODO">Highly Available Transactions: Virtues and Limitations (2014)</a></h2>
15 | <p><strong>Summary.</strong> Serializability is the gold standard of consistency, but databases have always provided weaker consistency modes (e.g. Read Committed, Repeatable Read) that promise improved performance. In this paper, Bailis et al. determine which of these weaker consistency models can be implemented with high availability.</p>
16 | <p>First, why is high availability important?</p>
17 | <ol style="list-style-type: decimal">
18 | <li><em>Partitions.</em> Partitions happen, and when they do non-available systems become, well, unavailable.</li>
19 | <li><em>Latency.</em> Partitions may be transient, but latency is forever. Highly available systems can avoid latency by eschewing coordination costs.</li>
20 | </ol>
21 | <p>Second, are weaker consistency models consistent enough? In short, yeah probably. In a survey of databases, Bailis finds that many do not employ serializability by default and some do not even provide full serializability. Bailis also finds that four of the five transactions in the TPC-C benchmark can be implemented with highly available transactions.</p>
22 | <p>After defining availability, Bailis presents the taxonomy of which consistency can be implemented as HATs, and also argues why some fundamentally cannot. He also performs benchmarks on AWS to show the performance benefits of HAT.</p>
23 |   </div>
24 | 
25 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
26 | 	<script>
27 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
28 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
29 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
30 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
31 | 
32 | 		ga('create', 'UA-90310997-2', 'auto');
33 | 		ga('send', 'pageview');
34 | 	</script>
35 | </body>
36 | </html>
37 | 


--------------------------------------------------------------------------------
/html/bailis2014coordination.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="coordination-avoidance-in-database-systems-2014"><a href="https://scholar.google.com/scholar?cluster=428435405994413003&amp;hl=en&amp;as_sdt=0,5">Coordination Avoidance in Database Systems (2014)</a></h2>
15 | <p><strong>Overivew.</strong> Coordination in a distributed system is sometimes necessary to maintain application correctness, or <em>consistency</em>. For example, a payroll application may require that each employee has a unique ID, or that a jobs relation only include valid employees. However, coordination is not cheap. It increases latency, and in the face of partitions can lead to unavailability. Thus, when application correctness permits, coordination should be avoided. This paper develops the necessary and sufficient conditions for when coordination is needed to maintain a set of database invariance using a notion of invariant-confluence or I-confluence.</p>
16 | <p><strong>System Model.</strong> A <em>database state</em> is a set D of object versions drawn from the set of all states <em>D</em>. Transactions operate on <em>logical replicas</em> in <em>D</em> that contain the set of object versions relevant to the transaction. Transactions are modeled as functions T : <em>D</em> -&gt; <em>D</em>. The effects of a transaction are merged into an existing replica using an associative, commutative, idempotent merge operator. Changes are shared between replicas and merges likewise. In this paper, merge is set union, and we assume we know all transactions in advance. Invariants are modeled as boolean functions I: <em>D</em> -&gt; 2. A state R is said to be I-valid if I(R) is true.</p>
17 | <p>We say a system has <em>transactional availability</em> if whenever a transaction T can contact servers with the appropriate data in T, it only aborts if T chooses to abort. We say a system is <em>convergent</em> if after updates quiesce, all servers eventually have the same state. A system is globally <em>I-valid</em> if all replicas always have I-valid states. A system provides coordination-free execution if execution of a given transaction does not depend on the execution of others.</p>
18 | <p><strong>Consistency Sans Coordination.</strong> A state Si is <em>I-T-Reachable</em> if its derivable from I-valid states with transactions in T. A set of transactions is <em>I</em>-confluent with respect to invariant I if for all I-T-Reachable states Di, Dj with common ancestor Di join Dj is I-valid. A globally I-valid system can execute a set of transactions T with global validity, transactional availability, convergence, and coordination-freedom if and only if T is I-confluent with respect to I.</p>
19 | <p><strong>Applying Invariant-Confluence.</strong> I-confluence can be applied to existing relation operators and constraints. For example, updates, inserts, and deletes are I-confluent with respect to per-record inequality constraint. Deletions are I-confluent with respect to foreign key constraints; additions and updates are not. I-confluence can also be applied to abstract data types like counters.</p>
20 |   </div>
21 | 
22 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
23 | 	<script>
24 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
25 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
26 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
27 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
28 | 
29 | 		ga('create', 'UA-90310997-2', 'auto');
30 | 		ga('send', 'pageview');
31 | 	</script>
32 | </body>
33 | </html>
34 | 


--------------------------------------------------------------------------------
/html/baumann2015shielding.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="shielding-applications-from-an-untrusted-cloud-with-haven-2014"><a href="https://scholar.google.com/scholar?cluster=12325554201123386346&amp;hl=en&amp;as_sdt=0,5">Shielding Applications from an Untrusted Cloud with Haven (2014)</a></h2>
15 | <p><strong>Summary.</strong> When running an application in the cloud, users have to trust (i) the cloud provider's software, (ii) the cloud provider's staff, and (iii) law enforcement with the ability to access user data. Intel SGX partially solves this problem by allowing users to run small portions of program on remote servers with guarantees of confidentiality and integrity. Haven leverages SGX and Drawbridge to run <em>entire legacy programs</em> with shielded execution.</p>
16 | <p>Haven assumes a very strong adversary which has access to all the system's software and most of the system's hardware. Only the processor and SGX hardware is trusted. Haven provides confidentiality and integrity, but not availability. It also does not prevent side-channel attacks.</p>
17 | <p>There are two main challenges that Haven's design addresses. First, most programs are written assuming a benevolent host. This leads to Iago attacks in which the OS subverts the application by exploiting its assumptions about the OS. Haven must operate correctly despite a <em>malicious host</em>. To do so, Haven uses a library operation system LibOS that is part of a Windows sandboxing framework called Drawbridge. LibOS implements a full OS API using only a few core host OS primitives. These core host OS primitives are used in a defensive way. A shield module sits below LibOS and takes great care to ensure that LibOS is not susceptible to Iago attacks. The user's application, LibOS, and the shield module are all run in an SGX enclave.</p>
18 | <p>Second, Haven aims to run <em>unmodified</em> binaries which were not written with knowledge of SGX. Real world applications allocate memory, load and run code dynamically, etc. Many of these things are not supported by SGX, so Haven (a) emulated them and (b) got the SGX specification revised to address them.</p>
19 | <p>Haven also implements an in-enclave encrypted file system in which only the root and leaf pages need to be written to stable storage. As of publication, however, Haven did not fully implement this feature. Haven is susceptible to replay attacks.</p>
20 | <p>Haven was evaluated by running Microsoft SQL Server and Apache HTTP Server.</p>
21 |   </div>
22 | 
23 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
24 | 	<script>
25 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
26 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
27 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
28 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
29 | 
30 | 		ga('create', 'UA-90310997-2', 'auto');
31 | 		ga('send', 'pageview');
32 | 	</script>
33 | </body>
34 | </html>
35 | 


--------------------------------------------------------------------------------
/html/bershad1995spin.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="spin----an-extensible-microkernel-for-application-specific-operating-system-services-1995"><a href="https://scholar.google.com/scholar?cluster=4910839957971330989&amp;hl=en&amp;as_sdt=0,5">SPIN -- An Extensible Microkernel for Application-specific Operating System Services (1995)</a></h2>
15 | <p><strong>Summary.</strong> Many operating systems were built a long time ago, and their performance was tailored to the applications and workloads at the time. More recent applications, like databases and multimedia applications, are quite different than these applications and can perform quite poorly on existing operating systems. SPIN is an extensible microkernel that allows applications to tailor the operating system to meet their needs.</p>
16 | <p>Existing operating systems fit into one of three categories:</p>
17 | <ol style="list-style-type: decimal">
18 | <li>They have no interface by which applications can modify kernel behavior.</li>
19 | <li>They have a clean interface applications can use to modify kernel behavior but the implementation of the interface is inefficient.</li>
20 | <li>They have an unconstrained interface that is efficiently implemented but does not provide isolation between applications.</li>
21 | </ol>
22 | <p>SPIN provides applications a way to efficiently and safely modify the behavior of the kernel. Programs in SPIN are divided into the user-level code and a spindle: a portion of user code that is dynamically installed and run in the kernel. The kernel provides a set of abstractions for physical and logical resources, and the spindles are responsible for managing these resources. The spindles can also register to be invoked when certain kernel events (i.e. page faults) occur. Installing spindles directly into the kernel provides efficiency. Applications can execute code in the kernel without the need for a context switch.</p>
23 | <p>To ensure safety, spindles are written in a typed object-oriented language. Each spindle is like an object; it contains local state and a set of methods. Some of these methods can be called by the application, and some are registered as callbacks in the kernel. A spindle checker uses a combination of static analysis and runtime checks to ensure that the spindles meet certain kernel invariants. Moreover, SPIN relies on advanced compiler technology to ensure efficient spindle compilation.</p>
24 | <p>General purpose high-performance computing, parallel processing, multimedia applications, databases, and information retrieval systems can benefit from the application-specific services provided by SPIN. Using techniques such as</p>
25 | <ul>
26 | <li>extensible IPC;</li>
27 | <li>application-level protocol processing;</li>
28 | <li>fast, simple, communication;</li>
29 | <li>application-specific file systems and buffer cache management;</li>
30 | <li>user-level scheduling;</li>
31 | <li>optimistic transaction;</li>
32 | <li>real-time scheduling policies;</li>
33 | <li>application-specific virtual memory; and</li>
34 | <li>runtime systems with memory system feedback,</li>
35 | </ul>
36 | <p>applications can be implemented more efficiently on SPIN than on traditional operating systems.</p>
37 |   </div>
38 | 
39 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
40 | 	<script>
41 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
42 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
43 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
44 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
45 | 
46 | 		ga('create', 'UA-90310997-2', 'auto');
47 | 		ga('send', 'pageview');
48 | 	</script>
49 | </body>
50 | </html>
51 | 


--------------------------------------------------------------------------------
/html/bugnion1997disco.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="disco-running-commodity-operating-systems-on-scalable-multiprocessors-1997"><a href="https://scholar.google.com/scholar?cluster=17298410582406300869&amp;hl=en&amp;as_sdt=0,5">Disco: Running Commodity Operating Systems on Scalable Multiprocessors (1997)</a></h2>
15 | <p><strong>Summary.</strong> Operating systems are complex, million line code bases. Multiprocessors were becoming popular, but it was too difficult to modify existing commercial operating systems to take full advantage of the new hardware. Disco is a <em>virtual machine monitor</em>, or <em>hypervisor</em>, that uses virtualization to run commercial virtual machines on cache-coherent NUMA multiprocessors. Guest operating systems running on Disco are only slightly modified, yet are still able to take advantage of the multiprocessor. Moreover, Disco offers all the traditional benefits of a hypervisor (e.g. fault isolation).</p>
16 | <p>Disco provides the following interfaces:</p>
17 | <ul>
18 | <li><em>Processors.</em> Disco provides full virtualization of the CPU allowing for restricted direct execution. Some privileged registers are mapped to memory to allow guest operating systems to read them.</li>
19 | <li><em>Physical memory.</em> Disco virtualizes the guest operating system's physical address spaces, mapping them to hardware addresses. It also supports page migration and replication to alleviate the non-uniformity of a NUMA machine.</li>
20 | <li><em>I/O devices.</em> All I/O communication is simulated, and various virtual disks are used.</li>
21 | </ul>
22 | <p>Disco is implemented as follows:</p>
23 | <ul>
24 | <li><em>Virtual CPUs.</em> Disco maintains the equivalent of a process table entry for each guest operating system. Dicso runs in kernel mode, guest operating systems run in supervisor mode, and applications run in user mode.</li>
25 | <li><em>Virtual physical memory.</em> To avoid the overhead of physical to hardware address translation, Disco maintains a large software physical to hardware TLB.</li>
26 | <li><em>NUMA.</em> Disco migrates pages to the CPUs that access them frequently, and replicates read-only pages to the CPUs that read them frequently. This dynamic page migration and replications helps mask the non-uniformity of a NUMA machine.</li>
27 | <li><em>Copy-on-write disks.</em> Disco can map physical addresses in different guest operating systems to a read-only page in hardware memory. This lowers the memory footprint of running multiple guest operating systems. The shared pages are copy-on-write.</li>
28 | <li><em>Virtual network interfaces.</em> Disco runs a virtual subnet over which guests can communicate using standard communication protocols like NFS and TCP. Disco uses a similar copy-on-write trick as above to avoid copying data between guests.</li>
29 | </ul>
30 |   </div>
31 | 
32 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
33 | 	<script>
34 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
35 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
36 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
37 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
38 | 
39 | 		ga('create', 'UA-90310997-2', 'auto');
40 | 		ga('send', 'pageview');
41 | 	</script>
42 | </body>
43 | </html>
44 | 


--------------------------------------------------------------------------------
/html/burns2016borg.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="borg-omega-and-kubernetes-2016"><a href="#borg-omega-and-kubernetes-2016">Borg, Omega, and Kubernetes (2016)</a></h2>
15 | <p><strong>Summary.</strong> Google has spent the last decade developing three container management systems. <em>Borg</em> is Google's main cluster management system that manages long running production services and non-production batch jobs on the same set of machines to maximize cluster utilization. <em>Omega</em> is a clean-slate rewrite of Borg using more principled architecture. In Omega, all system state lives in a consistent Paxos-based storage system that is accessed by a multitude of components which act as peers. <em>Kubernetes</em> is the latest open source container manager that draws on lessons from both previous systems.</p>
16 | <p>All three systems use containers for security and performance isolation. Container technology has evolved greatly since the inception of Borg from chroot to jails to cgroups. Of course containers cannot prevent all forms of performance isolation. Today, containers also contain program images.</p>
17 | <p>Containers allow the cloud to shift from a machine-oriented design to an application oriented-design and tout a number of advantages.</p>
18 | <ul>
19 | <li>The gap between where an application is developed and where it is deployed is shrunk.</li>
20 | <li>Application writes don't have to worry about the details of the operating system on which the application will run.</li>
21 | <li>Infrastructure operators can upgrade hardware without worrying about breaking a lot of applications.</li>
22 | <li>Telemetry is tied to applications rather than machines which improves introspection and debugging.</li>
23 | </ul>
24 | <p>Container management systems typically also provide a host of other niceties including:</p>
25 | <ul>
26 | <li>naming services,</li>
27 | <li>autoscaling,</li>
28 | <li>load balancing,</li>
29 | <li>rollout tools, and</li>
30 | <li>monitoring tools.</li>
31 | </ul>
32 | <p>In borg, these features were integrated over time in ad-hoc ways. Kubernetes organizes these features under a unified and flexible API.</p>
33 | <p>Google's experience has led a number of things to avoid:</p>
34 | <ul>
35 | <li>Container management systems shouldn't manage ports. Kubernetes gives each job a unique IP address allowing it to use any port it wants.</li>
36 | <li>Containers should have labels, not just numbers. Borg gave each task an index within its job. Kubernetes allows jobs to be labeled with key-value pairs and be grouped based on these labels.</li>
37 | <li>In Borg, every task belongs to a single job. Kubernetes makes task management more flexible by allowing a task to belong to multiple groups.</li>
38 | <li>Omega exposed the raw state of the system to its components. Kubernetes avoids this by arbitrating access to state through an API.</li>
39 | </ul>
40 | <p>Despite the decade of experience, there are still open problems yet to be solved:</p>
41 | <ul>
42 | <li>Configuration. Configuration languages begin simple but slowly evolve into complicated and poorly designed Turing complete programming languages. It's ideal to have configuration files be simple data files and let real programming languages manipulate them.</li>
43 | <li>Dependency management. Programs have lots of dependencies but don't manually state them. This makes automated dependency management very tough.</li>
44 | </ul>
45 |   </div>
46 | 
47 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
48 | 	<script>
49 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
50 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
51 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
52 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
53 | 
54 | 		ga('create', 'UA-90310997-2', 'auto');
55 | 		ga('send', 'pageview');
56 | 	</script>
57 | </body>
58 | </html>
59 | 


--------------------------------------------------------------------------------
/html/codd1970relational.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h1 id="a-relational-model-of-data-for-large-shared-data-banks-1970"><a href="https://scholar.google.com/scholar?cluster=1624408330930846885&amp;hl=en&amp;as_sdt=0,5">A Relational Model of Data for Large Shared Data Banks (1970)</a></h1>
15 | <h2 id="summary">Summary</h2>
16 | <p>In this paper, Ed Codd introduces the <em>relational data model</em>. Codd begins by motivating the importance of <em>data independence</em>: the independence of the way data is queried and the way data is stored. He argues that existing database systems at the time lacked data independence; namely, the ordering of relations, the indexes on relations, and the way the data was accessed was all made explicit when the data was queried. This made it impossible for the database to evolve the way data was stored without breaking existing programs which queried the data. The relational model, on the other hand, allowed for a much greater degree of data independence. After Codd introduces the relational model, he provides an algorithm to convert a relation (which may contain other relations) into <em>first normal form</em> (i.e. relations cannot contain other relations). He then describes basic relational operators, data redundancy, and methods to check for database consistency.</p>
17 | <h2 id="commentary">Commentary</h2>
18 | <ol style="list-style-type: decimal">
19 | <li>Codd's advocacy for data independence and a declarative query language have stood the test of time. I particularly enjoy one excerpt from the paper where Codd says, &quot;The universality of the data sublanguage lies in its descriptive ability (not its computing ability)&quot;.</li>
20 | <li>Database systems at the time generally had two types of data: collections and links between those collections. The relational model represented both as relations. Today, this seems rather mundane, but I can imagine this being counterintuitive at the time. This is also yet another example of a <em>unifying interface</em> which is demonstrated in both the Unix and System R papers.</li>
21 | </ol>
22 |   </div>
23 | 
24 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
25 | 	<script>
26 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
27 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
28 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
29 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
30 | 
31 | 		ga('create', 'UA-90310997-2', 'auto');
32 | 		ga('send', 'pageview');
33 | 	</script>
34 | </body>
35 | </html>
36 | 


--------------------------------------------------------------------------------
/html/conway2012logic.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="logic-and-lattices-for-distributed-programming-2012"><a href="TODO">Logic and Lattices for Distributed Programming (2012)</a></h2>
15 | <p><strong>Summary.</strong> CRDTs provide eventual consistency without the need for coordination. However, they suffer a <em>scope problem</em>: simple CRDTs are easy to reason about and use, but more complicated CRDTs force programmers to ensure they satisfy semilattice properties. They also lack composability. Consider, for example, a Students set and a Teams set. (Alice, Bob) can be added to Teams while concurrently Bob is removed from Students. Each individual set may be a CRDT, but there is no mechanism to enforce consistency between the CRDTs.</p>
16 | <p>Bloom and CALM, on the other hand, allow for mechanized program analysis to guarantee that a program can avoid coordination. However, Bloom suffers from a <em>type problem</em>: it only operates on sets which procludes the use of other useful structures such as integers.</p>
17 | <p>This paper merges CRDTs and Bloom together by introducing <em>bounded join semilattices</em> into Bloom to form a new language: Bloom^L. Bloom^L operates over semilattices, applying semilattice methods. These methods can be designated as non-monotonic, monotonic, or homomorphic (which implies monotonic). So long as the program avoids non-monotonic methods, it can be realized without coordination. Moreover, morphisms can be implemented more efficiently than non-homomorphic monotonic methods. Bloom^L comes built in with a family of useful semilattices like booleans ordered by implication, integers ordered by less than and greater than, sets, and maps. Users can also define their own semilattices, and Bloom^L allows smooth interoperability between Bloom collections and Bloom^L lattices. Bloom^L's semi-naive implementation is comparable to Bloom's semi-naive implementation.</p>
18 | <p>The paper also presents two case-studies. First, they implement a key-value store as a map from keys to values annotated with vector clocks: a design inspired from Dynamo. They also implement a purely monotonic shopping cart using custom lattices.</p>
19 |   </div>
20 | 
21 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
22 | 	<script>
23 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
24 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
25 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
26 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
27 | 
28 | 		ga('create', 'UA-90310997-2', 'auto');
29 | 		ga('send', 'pageview');
30 | 	</script>
31 | </body>
32 | </html>
33 | 


--------------------------------------------------------------------------------
/html/halevy2016goods.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="goods-organizing-googles-datasets-2016"><a href="TODO">Goods: Organizing Google's Datasets (2016)</a></h2>
15 | <p><strong>Summary.</strong> In fear of fettering development and innovation, companies often allow engineers free reign to generate and analyze datasets at will. This often leads to unorganized data lakes: a ragtag collection of datasets from a diverse set of sources. Google Dataset Search (Goods) is a system which uses unobstructive post-hoc metadata extraction and inference to organize Google's unorganized datasets and present curated dataset information, such as metadata and provenance, to engineers.</p>
16 | <p>Building a system like Goods at Google scale presents many challenges.</p>
17 | <ul>
18 | <li><em>Scale.</em> There are 26 billion datasets. <em>26 billion</em> (with a b)!</li>
19 | <li><em>Variety.</em> Data comes from a diverse set of sources (e.g. BigTable, Spanner, logs).</li>
20 | <li><em>Churn.</em> Roughly 5% of the datasets are deleted everyday, and datasets are created roughly as quickly as they are deleted.</li>
21 | <li><em>Uncertainty.</em> Some metadata inference is approximate and speculative.</li>
22 | <li><em>Ranking.</em> To facilitate useful dataset search, datasets have to be ranked by importance: a difficult heuristic-driven process.</li>
23 | <li><em>Semantics.</em> Extracting the semantic content of a dataset is useful but challenging. For example consider a file of protos that doesn't reference the type of proto being stored.</li>
24 | </ul>
25 | <p>The Goods catalog is a BigTable keyed by dataset name where each row contains metadata including</p>
26 | <ul>
27 | <li><em>basic metatdata</em> like timestamp, owners, and access permissions;</li>
28 | <li><em>provenance</em> showing the lineage of each dataset;</li>
29 | <li><em>schema</em>;</li>
30 | <li><em>data summaries</em> extracted from source code; and</li>
31 | <li><em>user provided annotations</em>.</li>
32 | </ul>
33 | <p>Moreover, similar datasets or multiple versions of the same logical dataset are grouped together to form <em>clusters</em>. Metadata for one element of a cluster can be used as metadata for other elements of the cluster, greatly reducing the amount of metadata that needs to be computed. Data is clustered by timestamp, data center, machine, version, and UID, all of which is extracted from dataset paths (e.g. <code>/foo/bar/montana/August01/foo.txt</code>).</p>
34 | <p>In addition to storing dataset metadata, each row also stores <em>status metadata</em>: information about the completion status of various jobs which operate on the catalog. The numerous concurrently executing batch jobs use <em>status metadata</em> as a weak form of synchronization and dependency resolution, potentially deferring the processing of a row until another job has processed it.</p>
35 | <p>The fault tolerance of these jobs is provided by a mix of job retries, BigTable's idempotent update semantics, and a watchdog that terminates divergent programs.</p>
36 | <p>Finally, a two-phase garbage collector tombstones rows that satisfy a garbage collection predicate and removes them one day later if they still match the predicate. Batch jobs do not process tombstoned rows.</p>
37 | <p>The Goods frontend includes dataset profile pages, dataset search driven by a handful of heuristics to rank datasets by importance, and teams dashboard.</p>
38 |   </div>
39 | 
40 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
41 | 	<script>
42 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
43 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
44 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
45 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
46 | 
47 | 		ga('create', 'UA-90310997-2', 'auto');
48 | 		ga('send', 'pageview');
49 | 	</script>
50 | </body>
51 | </html>
52 | 


--------------------------------------------------------------------------------
/html/hindman2011mesos.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="mesos-a-platform-for-fine-grained-resource-sharing-in-the-data-center-2011"><a href="https://scholar.google.com/scholar?cluster=816726489244916508&amp;hl=en&amp;as_sdt=0,5">Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center (2011)</a></h2>
15 | <p>See <a href="https://github.com/mwhittaker/mesos_talk"><code>https://github.com/mwhittaker/mesos_talk</code></a>.</p>
16 |   </div>
17 | 
18 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
19 | 	<script>
20 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
21 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
22 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
23 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
24 | 
25 | 		ga('create', 'UA-90310997-2', 'auto');
26 | 		ga('send', 'pageview');
27 | 	</script>
28 | </body>
29 | </html>
30 | 


--------------------------------------------------------------------------------
/html/kornacker2015impala.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="impala-a-modern-open-source-sql-engine-for-hadoop-2015"><a href="https://scholar.google.com/scholar?cluster=14277865292469814912&amp;hl=en&amp;as_sdt=0,5">Impala: A Modern, Open-Source SQL Engine for Hadoop (2015)</a></h2>
15 | <p><strong>Summary.</strong> Impala is a distributed query engine built on top of Hadoop. That is, it builds off of existing Hadoop tools and frameworks and reads data stored in Hadoop file formats from HDFS.</p>
16 | <p>Impala's <code>CREATE TABLE</code> commands specify the location and file format of data stored in Hadoop. This data can also be partitioned into different HDFS directories based on certain column values. Users can then issue typical SQL queries against the data. Impala supports batch INSERTs but doesn't support UPDATE or DELETE. Data can also be manipulated directly by going through HDFS.</p>
17 | <p>Impala is divided into three components.</p>
18 | <ol style="list-style-type: decimal">
19 | <li>An Impala daemon (impalad) runs on each machine and is responsible for receiving queries from users and for orchestrating the execution of queries.</li>
20 | <li>A single Statestore daemon (statestored) is a pub/sub system used to disseminate system metadata asynchronously to clients. The statestore has weak semantics and doesn't persist anything to disk.</li>
21 | <li>A single Catalog daemon (catalogd) publishes catalog information through the statestored. The catalogd pulls in metadata from external systems, puts it in Impala form, and pushes it through the statestored.</li>
22 | </ol>
23 | <p>Impala has a Java frontend that performs the typical database frontend operations (e.g. parsing, semantic analysis, and query optimization). It uses a two phase query planner.</p>
24 | <ol style="list-style-type: decimal">
25 | <li><em>Single node planning.</em> First, a single-node non-executable query plan tree is formed. Typical optimizations like join reordering are performed.</li>
26 | <li><em>Plan parallelization.</em> After a single node plan is formed, it is fragmented and divided between multiple nodes with the goal of minimizing data movement and maximizing scan locality.</li>
27 | </ol>
28 | <p>Impala has a C++ backed that uses Volcano style iterators with exchange operators and runtime code generation using LLVM. To efficiently read data from disk, Impala bypasses the traditional HDFS protocols. The backend supports a lot of different file formats including Avro, RC, sequence, plain test, and Parquet.</p>
29 | <p>For cluster and resource management, Impala uses a home grown Llama system that sits over YARN.</p>
30 |   </div>
31 | 
32 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
33 | 	<script>
34 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
35 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
36 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
37 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
38 | 
39 | 		ga('create', 'UA-90310997-2', 'auto');
40 | 		ga('send', 'pageview');
41 | 	</script>
42 | </body>
43 | </html>
44 | 


--------------------------------------------------------------------------------
/html/lehman1999t.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="t-spaces-the-next-wave-1999"><a href="https://goo.gl/mxIv4g">T Spaces: The Next Wave (1999)</a></h2>
15 | <p><strong>Summary.</strong> T Spaces is a</p>
16 | <blockquote>
17 | <p>tuplespace-based network communication buffer with database capabilities that enables communication between applications and devices in a network of heterogeneous computers and operating systems</p>
18 | </blockquote>
19 | <p>Essentially, it's Linda++; it implements a Linda tuplespace with a couple new operators and transactions.</p>
20 | <p>The paper begins with a history of related tuplespace based work. The notion of a shared collaborative space originated from AI <em>blackboard systems</em> popular in the 1970's, the most famous of which was the Hearsay-II system. Later, the Stony Brook microcomputer Network (SBN), a cluster organized in a torus topology, was developed at Stony Brook, and Linda was invented to program it. Over time, the domain in which tuplespaces were popular shifted from parallel programming to distributed programming, and a huge number of Linda-like systems were implemented.</p>
21 | <p>T Spaces is the marriage of tuplespaces, databases, and Java.</p>
22 | <ul>
23 | <li><em>Tuplespaces</em> provide a flexible communication model;</li>
24 | <li><em>databases</em> provide stability, durability, and advanced querying; and</li>
25 | <li><em>Java</em> provides portability and flexibility.</li>
26 | </ul>
27 | <p>T Spaces implements a Linda tuplespace with a few improvements:</p>
28 | <ul>
29 | <li>In addition to the traditional <code>Read</code>, <code>Write</code>, <code>Take</code>, <code>WaitToRead</code>, and <code>WaitToTake</code> operators, T Spaces also introduces a <code>Scan</code>/<code>ConsumingScan</code> operator to read/take all tuples matched by a query and a <code>Rhonda</code> operator to exchange tuples between processes.</li>
30 | <li>Users can also dynamically register new operators, the implementation of which takes of advantage of Java.</li>
31 | <li>Fields of tuples are indexed by name, and tuples can be queried by named value. For example, the query <code>(foo = 8)</code> returns <em>all</em> tuples (of any type) with a field <code>foo</code> equal to 8. These indexes are similar to the inversions implemented in Phase 0 of System R.</li>
32 | <li>Range queries are supported.</li>
33 | <li>To avoid storing large values inside of tuples, files URLs can instead be stored, and T Spaces transparently handles locating and transferring the file contents.</li>
34 | <li>T Spaces implements a group based ACL form of authorization.</li>
35 | <li>T Spaces supports transactions.</li>
36 | </ul>
37 | <p>To evaluate the expressiveness and performance of T Spaces, the authors implement a collaborative web-crawling application, a web-search information delivery system, and a universal information appliance.</p>
38 |   </div>
39 | 
40 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
41 | 	<script>
42 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
43 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
44 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
45 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
46 | 
47 | 		ga('create', 'UA-90310997-2', 'auto');
48 | 		ga('send', 'pageview');
49 | 	</script>
50 | </body>
51 | </html>
52 | 


--------------------------------------------------------------------------------
/html/letia2009crdts.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="crdts-consistency-without-concurrency-control-2009"><a href="https://scholar.google.com/scholar?cluster=9773072957814807258&amp;hl=en&amp;as_sdt=0,5">CRDTs: Consistency without concurrency control (2009)</a></h2>
15 | <p><strong>Overview.</strong> Concurrently updating distributed mutable data is a challenging problem that often requires expensive coordination. <em>Commutative replicated data types</em> (CRDTs) are data types with commutative update operators that can provide convergence without coordination. Moreover, non-trivial CRTDs exist; this paper presents Treedoc: an ordered set CRDT.</p>
16 | <p><strong>Ordered Set CRDT.</strong> An ordered set CRDT represents an ordered sequence of atoms. Atoms are associated with IDs with five properties: 1. Two replicas of the same atom have the same ID. 2. No two atoms have the same ID. 3. IDs are constant for the lifetime of an ordered set. 4. IDs are totally ordered. 5. The space if IDs is dense. That is for all IDS P and F where P &lt; F, there exists an ID N such that P &lt; N &lt; F.</p>
17 | <p>The ordered set supports two operations:</p>
18 | <ol style="list-style-type: decimal">
19 | <li>insert(ID, atom)</li>
20 | <li>delete(ID)</li>
21 | </ol>
22 | <p>where atoms are ordered by their corresponding ID. Concretely, Treedoc is represented as a tree, and IDs are paths in the tree ordered by an infix traversal. Nodes in the tree can be <em>major nodes</em> and contain multiple <em>mini-nodes</em> where each mini-node is annotated with a totally ordered <em>disambiguator</em> unique to each node.</p>
23 | <p>Deletes simply tombstone a node. Inserts work like a typical binary tree insertion. To avoid the tree and IDs from getting too large, the tree can periodically be <em>flattened</em>: the tree is restructured into an array of nodes. This operation does not commute nicely with others, so a coordination protocol like 2PC is required.</p>
24 | <p><strong>Treedoc in the Large Scale.</strong> Requiring a coordination protocol for flattening doesn't scale and runs against the spirit of CRDTs. It also doesn't handle churn well. Instead, nodes can be partitioned into a set of core nodes and a set of nodes in the nebula. The core nodes coordinate while the nebula nodes lag behind. There are protocols for nebula nodes to catch up to the core nodes.</p>
25 |   </div>
26 | 
27 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
28 | 	<script>
29 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
30 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
31 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
32 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
33 | 
34 | 		ga('create', 'UA-90310997-2', 'auto');
35 | 		ga('send', 'pageview');
36 | 	</script>
37 | </body>
38 | </html>
39 | 


--------------------------------------------------------------------------------
/html/li2014automating.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="automating-the-choice-of-consistency-levels-in-replicated-systems-2014"><a href="https://scholar.google.com/scholar?cluster=5894108460168532172&amp;hl=en&amp;as_sdt=0,5">Automating the Choice of Consistency Levels in Replicated Systems (2014)</a></h2>
15 | <p>Geo-replicated data stores that replicate data have to choose between incurring the performance overheads of implementing strong consistency or the brain boggling semantics of weak consistency. Some data stores allow users to make this decision on a fine grained level, allowing some operations to operate with strong consistency while other operations operate under weak consistency. For example, <a href="https://scholar.google.com/scholar?cluster=4316742817395056095&amp;hl=en&amp;as_sdt=0,5"><em>Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary</em></a> introduced RedBlue consistency in which users annotate operations as red (strong) or blue (weak). Typically, the process of choosing the consistency level for each operation has been a manual task. This paper builds off of RedBlue consistency and automates the process of labeling operations as red or blue.</p>
16 | <p><strong>Overview.</strong> When using RedBlue consistency, users can decompose operations into generator operations and shadow operations to improve commutativity. This paper automates this process by converting Java applications that write state to a relational database to issue operations against CRDTs. Moreover, it uses a combination of static analysis and runtime checks to ensure that operations are invariant confluent.</p>
17 | <p><strong>Generating Shadow Operations.</strong> Relations are just sets of tuples, so we can model them as CRDT sets. Moreover, we can model individual fields as CRDTs. This paper presents field CRDTs (PN-Counters, LWW-registers) and set CRDTs to model relations. Users then annotate SQL create statements indicating which CRDT to use.</p>
18 | <p>Using a custom JDBC driver, user operations can be converted into a corresponding shadow operation that issues CRDT operations.</p>
19 | <p><strong>Classification of Shadow Operations.</strong> We want to find the database states and transaction arguments that guarantee invariant-confluence. This can be a very difficult (probably undecidable) problem in general. Instead, we simplify the problem by considering the possible trace of CRDT operations through a program. Even with this simplifying assumption, tracing execution through loops can be challenging. We can again simplify things by only considering loop where each iteration is independent of one another.</p>
20 | <p>We model each program as a regular expression over statements. We then unroll the regular expression to get the set of all execution traces. Using these traces we can construct a map from traces, or templates, to weakest preconditions that ensure invariant confluence.</p>
21 | <p>At runtime, we need only look up the weakest precondition given an execution trace and check that it is true.</p>
22 | <p><strong>Implementation.</strong> The system is implemented as 15K lines of Java and 533 lines of OCaml. It builds off of MySQL, an existing Java parser, and Gemini.</p>
23 |   </div>
24 | 
25 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
26 | 	<script>
27 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
28 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
29 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
30 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
31 | 
32 | 		ga('create', 'UA-90310997-2', 'auto');
33 | 		ga('send', 'pageview');
34 | 	</script>
35 | </body>
36 | </html>
37 | 


--------------------------------------------------------------------------------
/html/mckusick1984fast.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h1 id="a-fast-file-system-for-unix-1984"><a href="https://scholar.google.com/scholar?cluster=1900924654174602790">A Fast File System for UNIX (1984)</a></h1>
15 | <p>The <strong>Fast Filesystem</strong> (FFS) improved the read and write throughput of the original Unix file system by 10x by</p>
16 | <ol style="list-style-type: decimal">
17 | <li>increasing the block size,</li>
18 | <li>dividing blocks into fragments, and</li>
19 | <li>performing smarter allocation.</li>
20 | </ol>
21 | <p>The original Unix file system, dubbed &quot;the old file system&quot;, divided disk drives into partitions and loaded a file system on to each partition. The filesystem included a superblock containing metadata, a linked list of free data blocks known as the <strong>free list</strong>, and an <strong>inode</strong> for every file. Notably, the file system was composed of <strong>512 byte</strong> blocks; no more than 512 bytes could be transfered from the disk at once. Moreover, the file system had poor data locality. Files were often sprayed across the disk requiring lots of random disk accesses.</p>
22 | <p>The &quot;new file system&quot; improved performance by increasing the block size to any power of two at least as big as <strong>4096 bytes</strong>. In order to handle small files efficiently and avoid high internal fragmentation and wasted space, blocks were further divided into <strong>fragments</strong> at least as large as the disk sector size.</p>
23 | <pre><code>      +------------+------------+------------+------------+
24 | block | fragment 1 | fragment 2 | fragment 3 | fragment 4 |
25 |       +------------+------------+------------+------------+</code></pre>
26 | <p>Files would occupy as many complete blocks as possible before populating at most one fragmented block.</p>
27 | <p>Data was also divided into <strong>cylinder groups</strong> where each cylinder group included a copy of the superblock, a list of inodes, a bitmap of available blocks (as opposed to a free list), some usage statistics, and finally data blocks. The file system took advantage of hardware specific information to place data at rotational offsets specific to the hardware so that files could be read with as little delay as possible. Care was also taken to allocate files contiguously, similar files in the same cylinder group, and all the inodes in a directory together. Moreover, if the amount of available space gets too low, then it becomes more and more difficult to allocate blocks efficiently. For example, it becomes hard to allocate the files of a block contiguously. Thus, the system always tries to keep ~10% of the disk free.</p>
28 | <p>Allocation is also improved in the FFS. A top level global policy uses file system wide information to decide where to put new files. Then, a local policy places the blocks. Care must be taken to colocate blocks that are accessed together, but crowding a single cyclinder group can exhaust its resources.</p>
29 | <p>In addition to performance improvements, FFS also introduced</p>
30 | <ol style="list-style-type: decimal">
31 | <li>longer filenames,</li>
32 | <li>advisory file locks,</li>
33 | <li>soft links,</li>
34 | <li>atomic file renaming, and</li>
35 | <li>disk quota enforcement.</li>
36 | </ol>
37 |   </div>
38 | 
39 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
40 | 	<script>
41 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
42 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
43 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
44 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
45 | 
46 | 		ga('create', 'UA-90310997-2', 'auto');
47 | 		ga('send', 'pageview');
48 | 	</script>
49 | </body>
50 | </html>
51 | 


--------------------------------------------------------------------------------
/html/moore2006inferring.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="inferring-internet-denial-of-service-activity-2001"><a href="TODO">Inferring Internet Denial-of-Service Activity (2001)</a></h2>
15 | <p><strong>Summary.</strong> This paper uses <em>backscatter analysis</em> to quantitatively analyze denial-of-service attacks on the Internet. Most flooding denial-of-service attacks involve IP spoofing, where each packet in an attack is given a faux IP address drawn uniformly at random from the space of all IP addresses. If the packet elicits the victim to issue a reply packet, then victims of denial-of-service attacks end up sending unsolicited messages to servers uniformly at random. By monitoring this <em>backscatter</em> at enough hosts, one can infer the number, intensity, and type of denial-of-service attacks.</p>
16 | <p>There are of course a number of assumptions upon which backscatter depends.</p>
17 | <ol style="list-style-type: decimal">
18 | <li><em>Address uniformity</em>. It is assumed that DOS attackers spoof IP addresses uniformally at random.</li>
19 | <li><em>Reliable delivery</em>. It is assumed that packets, as part of the attack and response, are delivered reliably.</li>
20 | <li><em>Backscatter hypothesis</em>. It is assumed that unsolicited packets arriving at a host are actually part of backscatter.</li>
21 | </ol>
22 | <p>The paper performs a backscatter analysis on 1/256 of the IPv4 address space. They cluster the backscatter data using a <em>flow-based classification</em> to measure individual attacks and using an <em>event-based classification</em> to measure the intensity of attacks. The findings of the analysis are best summarized by the paper.</p>
23 |   </div>
24 | 
25 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
26 | 	<script>
27 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
28 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
29 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
30 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
31 | 
32 | 		ga('create', 'UA-90310997-2', 'auto');
33 | 		ga('send', 'pageview');
34 | 	</script>
35 | </body>
36 | </html>
37 | 


--------------------------------------------------------------------------------
/html/ritchie1978unix.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h1 id="the-unix-time-sharing-system-1974"><a href="https://scholar.google.com/scholar?cluster=2132419950152599605&amp;hl=en&amp;as_sdt=0,5">The Unix Time-Sharing System (1974)</a></h1>
15 | <p>Unix was an operating system developed by Dennis Ritchie, Ken Thompson, and others at Bell Labs. It was the successor to Multics and is probably the single most influential piece of software ever written.</p>
16 | <p>Earlier versions of Unix were written in assembly, but the project was later ported to C: probably the single most influential programming language ever developed. This resulted in a 1/3 increase in size, but the code was much more readable and the system included new features, so it was deemed worth it.</p>
17 | <p>The most important feature of Unix was its file system. Ordinary files were simple arrays of bytes physically stored as 512-byte blocks: a rather simple design. Each file was given an inumber: an index into an ilist of inodes. Each inode contained metadata about the file and pointers to the actual data of the file in the form of direct and indirect blocks. This representation made it easy to support (hard) linking. Each file was protected with 9 bits: the same protection model Linux uses today. Directories were themselves files which stored mappings from filenames to inumbers. Devices were modeled simply as files in the <code>/dev</code> directory. This unifying abstraction allowed devices to be accessed with the same API. File systems could be mounted using the <code>mount</code> command. Notably, Unix didn't support user level locking, as it was neither necessary nor sufficient.</p>
18 | <p>Processes in Unix could be created using a fork followed by an exec, and processes could communicate with one another using pipes. The shell was nothing more than an ordinary process. Unix included file redirection, pipes, and the ability to run programs in the background. All this was implemented using fork, exec, wait, and pipes.</p>
19 | <p>Unix also supported signals.</p>
20 |   </div>
21 | 
22 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
23 | 	<script>
24 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
25 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
26 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
27 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
28 | 
29 | 		ga('create', 'UA-90310997-2', 'auto');
30 | 		ga('send', 'pageview');
31 | 	</script>
32 | </body>
33 | </html>
34 | 


--------------------------------------------------------------------------------
/html/saltzer1984end.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h1 id="end-to-end-arguments-in-system-design-1984"><a href="https://scholar.google.com/scholar?cluster=9463646641349983499">End-to-End Arguments in System Design (1984)</a></h1>
15 | <p>This paper presents the <strong>end-to-end argument</strong>:</p>
16 | <blockquote>
17 | <p>The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the end points of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement.)</p>
18 | </blockquote>
19 | <p>which says that in a layered system, functionality should, nay must be implemented as close to the application as possible to ensure correctness (and usually also performance).</p>
20 | <p>The end-to-end argument is motivated by an example file transfer scenario in which host A transfers a file to host B. Every step of the file transfer presents an opportunity for failure. For example, the disk may silently corrupt data or the network may reorder or drop packets. Any attempt by one of these subsystems to ensure reliable delivery is wasted effort since the delivery may still fail in another subsystem. The only way to guarantee correctness is to have the file transfer application check for correct delivery itself. For example, once it receives the entire file, it can send the file's checksum back to host A to confirm correct delivery.</p>
21 | <p>In addition to being necessary for correctness, applying the end-to-end argument also usually leads to improved performance. When a functionality is implemented in a lower level subsystem, every application built on it must pay the cost, even if it does not require the functionality.</p>
22 | <p>There are numerous other examples of the end-to-end argument:</p>
23 | <ul>
24 | <li>Guaranteed packet delivery.</li>
25 | <li>Secure data transmission.</li>
26 | <li>Duplicate message suppression.</li>
27 | <li>FIFO delivery.</li>
28 | <li>Transaction management.</li>
29 | <li>RISC.</li>
30 | </ul>
31 | <p>The end-to-end argument is not a hard and fast rule. In particular, it may be eschewed when implementing a functionality in a lower level can lead to performance improvements. Consider again the file transfer protocol above and assume the network drops one in every 100 packets. As the file becomes longer, the odds of a successful delivery become increasingly small making it prohibitively expensive for the application alone to ensure reliable delivery. The network may be able to perform a small amount of work to help guarantee reliable delivery making the file transfer more efficient.</p>
32 |   </div>
33 | 
34 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
35 | 	<script>
36 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
37 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
38 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
39 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
40 | 
41 | 		ga('create', 'UA-90310997-2', 'auto');
42 | 		ga('send', 'pageview');
43 | 	</script>
44 | </body>
45 | </html>
46 | 


--------------------------------------------------------------------------------
/html/vavilapalli2013apache.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h2 id="apache-hadoop-yarn-yet-another-resource-negotiator-2013"><a href="https://scholar.google.com/scholar?cluster=3355598125951377731&amp;hl=en&amp;as_sdt=0,5">Apache Hadoop YARN: Yet Another Resource Negotiator (2013)</a></h2>
15 | <p><strong>Summary.</strong> Hadoop began as a MapReduce clone designed for large scale web crawling. As big data became trendy and data became... big, Hadoop became the de facto standard data processing system, and large Hadoop clusters were installed in many companies as &quot;the&quot; cluster. As application requirements evolved, users started abusing the large Hadoop in unintended ways. For example, users would submit map-only jobs which were thinly guised web servers. Apache Hadoop YARN is a cluster manager that aims to disentangle cluster management from programming paradigm and has the following goals:</p>
16 | <ul>
17 | <li>Scalability</li>
18 | <li>Multi-tenancy</li>
19 | <li>Serviceability</li>
20 | <li>Locality awareness</li>
21 | <li>High cluster utilization</li>
22 | <li>Reliability/availability</li>
23 | <li>Secure and auditable operation</li>
24 | <li>Support for programming model diversity</li>
25 | <li>Flexible resource model</li>
26 | <li>Backward compatibility</li>
27 | </ul>
28 | <p>YARN is orchestrated by a per-cluster <em>Resource Manager</em> (RM) that tracks resource usage and node liveness, enforces allocation invariants, and arbitrates contention among tenants. <em>Application Masters</em> (AM) are responsible for negotiating resources with the RM and manage the execution of single job. AMs send ResourceRequests to the RM telling it resource requirements, locality preferences, etc. In return, the RM hands out <em>containers</em> (e.g. &lt;2GB RAM, 1 CPU&gt;) to AMs. The RM also communicates with Node Managers (NM) running on each node which are responsible for measuring node resources and managing (i.e. starting and killing) tasks. When a user want to submit a job, it sends it to the RM which hands a capability to an AM to present to an NM. The RM is a single point of failure. If it fails, it restores its state from disk and kills all running AMs. The AMs are trusted to be faul-tolerant and resubmit any prematurely terminated jobs.</p>
29 | <p>YARN is deployed at Yahoo where it manages roughly 500,000 daily jobs. YARN supports frameworks like Hadoop, Tez, Spark, Dryad, Giraph, and Storm.</p>
30 |   </div>
31 | 
32 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
33 | 	<script>
34 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
35 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
36 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
37 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
38 | 
39 | 		ga('create', 'UA-90310997-2', 'auto');
40 | 		ga('send', 'pageview');
41 | 	</script>
42 | </body>
43 | </html>
44 | 


--------------------------------------------------------------------------------
/html/vogels2009eventually.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h1 id="eventually-consistent-2009"><a href="https://scholar.google.com/scholar?cluster=4308857796184904369">Eventually Consistent (2009)</a></h1>
15 | <p>In this CACM article, Werner Vogels discusses eventual consistency as well as a couple other forms of consistency.</p>
16 | <h2 id="historical-perspective">Historical Perspective</h2>
17 | <p>Back in the day, when people were thinking about how to build distributed systems, they thought that strong consistency was the only option. Anything else would just be incorrect, right? Well, fast forward to the 90's and availability started being favored over consistency. In 2000, Brewer released the CAP theorem unto the world, and weak consistency really took off.</p>
18 | <h2 id="clients-perspective-of-consistency">Client's Perspective of Consistency</h2>
19 | <p>There is a zoo of consistency from the perspective of a client.</p>
20 | <ul>
21 | <li><strong>Strong consistency.</strong> A data store is strongly consistent if after a write completes, it is visible to all subsequent reads.</li>
22 | <li><strong>Weak consistency.</strong> Weak consistency is a synonym for no consistency. Your reads can return garbage.</li>
23 | <li><strong>Eventual consistency.</strong> &quot;the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value.&quot;</li>
24 | <li><strong>Causal consistency.</strong> If A issues a write and then contacts B, B's read will see the effect of A's write since it causally comes after it. Causally unrelated reads can return whatever they want.</li>
25 | <li><strong>Read-your-writes consistency.</strong> Clients read their most recent write or a more recent version.</li>
26 | <li><strong>Session consistency.</strong> So long as a client maintains a session, it gets read-your-writes consistency.</li>
27 | <li><strong>Monotonic read consistency.</strong> Over time, a client will see increasingly more fresh versions of data.</li>
28 | <li><strong>Monotonic write consistency.</strong> A client's writes will be executed in the order in which they are issued.</li>
29 | </ul>
30 | <h2 id="servers-perspective-of-consistency">Server's Perspective of Consistency</h2>
31 | <p>Systems can implement consistency using quorums in which a write is sent to W of the N replicas, and a read is sent to R of the N replicas. If R + W &gt; N, then we have strong consistency.</p>
32 |   </div>
33 | 
34 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
35 | 	<script>
36 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
37 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
38 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
39 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
40 | 
41 | 		ga('create', 'UA-90310997-2', 'auto');
42 | 		ga('send', 'pageview');
43 | 	</script>
44 | </body>
45 | </html>
46 | 


--------------------------------------------------------------------------------
/html/zaharia2012resilient.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | <head>
 4 |   <title>Papers</title>
 5 |   <link href='../css/style.css' rel='stylesheet'>
 6 |   <meta name=viewport content="width=device-width, initial-scale=1">
 7 | </head>
 8 | 
 9 | <body>
10 |   <div id=header>
11 |     <a href="../">Papers</a>
12 |   </div>
13 |   <div id="container">
14 | <h1 id="resilient-distributed-datasets-a-fault-tolerant-abstraction-for-in-memory-cluster-computing-2012"><a href="TODO">Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (2012)</a></h1>
15 | <p>Frameworks like MapReduce made processing large amounts of data easier, but they did not leverage distributed memory. If a MapReduce was run iteratively, it would write all of its intermediate state to disk: something that was prohibitively slow. This limitation made batch processing systems like MapReduce ill-suited to <em>iterative</em> (e.g. k-means clustering) and <em>interactive</em> (e.g. ad-hoc queries) workflows. Other systems like Pregel did take advantage of distributed memory and reused the in-memory data across computations, but the systems were not general-purpose.</p>
16 | <p>Spark uses <strong>Resilient Distributed Datasets</strong> (RDDs) to perform general computations in memory. RDDs are immutable partitioned collections of records. Unlike pure distributed shared memory abstractions which allow for arbitrary fine-grained writes, RDDs can only be constructed using coarse-grained transformations from on-disk data or other RDDs. This weaker abstraction can be implemented efficiently. Spark also uses RDD lineage to implement low-overhead fault tolerance. Rather than persist intermediate datasets, the lineage of an RDD can be persisted and efficiently recomputed. RDDs could also be checkpointed to avoid the recomputation of a long lineage graph.</p>
17 | <p>Spark has a Scala-integrated API and comes with a modified interactive interpreter. It also includes a large number of useful <strong>transformations</strong> (which construct RDDs) and <strong>actions</strong> (which derive data from RDDs). Users can also manually specify RDD persistence and partitioning to further improve performance.</p>
18 | <p>Spark subsumed a huge number of existing data processing frameworks like MapReduce and Pregel in a small amount of code. It was also much, much faster than everything else on a large number of applications.</p>
19 |   </div>
20 | 
21 |   <script type="text/javascript" src="../js/mathjax_config.js"></script>
22 | 	<script>
23 | 		(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
24 | 		(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
25 | 		m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
26 | 		})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
27 | 
28 | 		ga('create', 'UA-90310997-2', 'auto');
29 | 		ga('send', 'pageview');
30 | 	</script>
31 | </body>
32 | </html>
33 | 


--------------------------------------------------------------------------------
/images/dan2017using_fork.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mwhittaker/papers/b37f7047b911a1abe9a52b85654d0ff8bcc10c0f/images/dan2017using_fork.png


--------------------------------------------------------------------------------
/images/dan2017using_zigzag.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mwhittaker/papers/b37f7047b911a1abe9a52b85654d0ff8bcc10c0f/images/dan2017using_zigzag.png


--------------------------------------------------------------------------------
/js/mathjax_config.js:
--------------------------------------------------------------------------------
 1 | window.MathJax = {
 2 |   tex2jax: {
 3 |     inlineMath: [['$','$'], ['\\(','\\)']],
 4 |     skipTags: ['script', 'noscript', 'style', 'textarea'],
 5 |   },
 6 |   TeX: {
 7 |     Macros: {
 8 |       set: ["\\left\\{#1\\right\\}", 1],
 9 |       setst: ["\\left\\{#1 \\,\\middle|\\, #2\\right\\}", 2],
10 |       parens: ["\\left(#1\\right)", 1],
11 |       reals: "\\mathbb{R}",
12 |       nats: "\\mathbb{N}",
13 |       rats: "\\mathbb{Q}",
14 |       ints: "\\mathbb{Z}",
15 |       defeq: "\\triangleq",
16 |       coleq: "\\texttt{:-}",
17 |       norm: ["\\left\\lVert#1\\right\\rVert", 1],
18 |       twonorm: ["\\norm{#1}_2", 1],
19 |     },
20 |   },
21 | };
22 | 


--------------------------------------------------------------------------------
/papers/adya2000generalized.md:
--------------------------------------------------------------------------------
 1 | ## [Generalized Isolation Level Definitions (2000)](TODO) ##
 2 | **Summary.**
 3 | In addition to serializability, ANSI SQL-92 defined a set of weaker isolation
 4 | levels that applications could use to improve performance at the cost of
 5 | consistency. The definitions were implementation-independent but ambiguous.
 6 | Berenson et al. proposed a revision of the isolation level definitions that was
 7 | unambiguous but specific to locking. Specifically, they define a set of
 8 | *phenomena*:
 9 | 
10 | - P0: `w1(x) ... w2(x) ...`      *"dirty write"*
11 | - P1: `w1(x) ... r2(x) ...`      *"dirty read"*
12 | - P2: `r1(x) ... w2(x) ...`      *"unrepeatable read"*
13 | - P3: `r1(P) ... w2(y in P) ...` *"phantom read"*
14 | 
15 | and define the isolation levels according to which phenomena they preclude.
16 | This preclusion can be implemented by varying how long certain types of locks
17 | are held:
18 | 
19 | | write locks | read locks | phantom locks | precluded      |
20 | | ----------- | ---------- | ------------- | -------------- |
21 | | short       | short      | short         | P0             |
22 | | long        | short      | short         | P0, P1         |
23 | | long        | long       | short         | P0, P1, P2     |
24 | | long        | long       | long          | P0, P1, P2, P3 |
25 | 
26 | This locking-specific *preventative* approach to defining isolation levels,
27 | while unambiguous, rules out many non-locking implementations of concurrency
28 | control. Notably, it does not allow for multiversioning and does not allow
29 | non-committed transactions to experience weaker consistency than committed
30 | transactions. Moreover, many isolation levels are naturally expressed as
31 | invariants between multiple objects, but these definitions are all over a
32 | single object.
33 | 
34 | This paper introduces implementation-independent unambiguous isolation level
35 | definitions. The definitions also include notions of predicates at all levels.
36 | It does so by first introducing the definition of a *history* as a partial
37 | order of read/write/commit/abort events and total order of commited object
38 | versions.  It then introduces three dependencies: *read-dependencies*,
39 | *anti-dependencies*, and *write-dependencies* (also known as write-read,
40 | read-write, and write-write dependencies). Next, it describes how to construct
41 | dependency graph and defines isolation levels as constraints on these graphs.
42 | 
43 | For example, the G0 phenomenon says that a dependency graph contains a
44 | write-dependency cycle. PL-1 is the isolation level that precludes G0.
45 | Similarly, the G1 phenomenon says that either
46 | 
47 | 1. a committed transaction reads an aborted value,
48 | 2. a committed transaction reads an intermediate value, or
49 | 3. there is a write-read/write-write cycle.
50 | 
51 | The PL-2 isolation level precludes G1 (and therefore G0) and corresponds
52 | roughly to the READ-COMMITTED isolation level.
53 | 


--------------------------------------------------------------------------------
/papers/agrawal1987concurrency.md:
--------------------------------------------------------------------------------
 1 | # [Concurrency Control Performance Modeling: Alternatives and Implications (1987)](https://scholar.google.com/scholar?cluster=9784855600346107276&hl=en&as_sdt=0,5)
 2 | ## Overview
 3 | There are three types of concurrency control algorithms: locking algorithms,
 4 | timestamp based algorithms, optimistic algorithms. There have been a large
 5 | number of performance analyses aimed at deciding which type of concurrency
 6 | algorithm is best, however despite the abundance of analyses, there is no
 7 | definitive winner. Different analyses have contradictory results, largely
 8 | because there is no standard performance model or set of assumptions. This
 9 | paper presents a complete database model for evaluating the performance of
10 | concurrency control algorithms and discusses how varying assumptions affect the
11 | performance of various algorithms.
12 | 
13 | ## Performance Model
14 | This paper analyzes three specific concurrency control mechanisms,
15 | 
16 | - **Blocking.** Transactions acquire locks before they access a data item.
17 |   Whenever a transaction acquires a lock, deadlock detection is run. If a
18 |   deadlock is detected, the youngest transaction is aborted.
19 | - **Immediate-restart.** Again, transactions acquire locks, but instead of
20 |   blocking if a lock cannot be immediately acquired, the transaction is instead
21 |   aborted and restarted with delay. This delay is adjusted dynamically to be
22 |   roughly equal to the average transaction duration.
23 | - **Optimistic.** Transactions do not acquire locks. A transaction is aborted
24 |   when it goes to commit if it read any objects that had been written and
25 |   committed since the transaction began.
26 | 
27 | using a closed queueing model of a single-site database. Essentially,
28 | transactions come in, sit in some queues, and are controlled by a concurrency
29 | control algorithm. The model has a number of parameters, some of which are held
30 | constant for all the experiments and some of which are varied from experiment
31 | to experiment. Some of the parameters had to be tuned to get interesting
32 | result. For example, it was found that with a large database and few conflicts,
33 | all concurrency control algorithms performed roughly the same and scaled with
34 | the degree of parallelism.
35 | 
36 | ## Resource-Related Assumptions
37 | Some analyses assume infinite resources. How do these assumptions affect
38 | concurrency control performance?
39 | 
40 | - **Experiment 1: Infinite Resources.** Given infinite resources, higher
41 |   degrees of parallelism lead to higher likelihoods of transaction conflict
42 |   which in turn leads to higher likelihoods of transaction abort and restart.
43 |   The blocking algorithm thrashes because of these increased conflicts. The
44 |   immediate-restart algorithm plateaus because the dynamic delay effectively
45 |   limits the amount of parallelism. The optimistic algorithm does well because
46 |   aborted transactions are immediately replaced with other transactions.
47 | - **Experiment 2: Limited Resources.** With a limited number of resources, all
48 |   three algorithms thrash, but blocking performs best.
49 | - **Experiment 3: Multiple Resources.** The blocking algorithm performs best up
50 |   to about 25 resource units (i.e. 25 CPUs and 50 disks); after that, the
51 |   optimistic algorithm performs best.
52 | - **Experiment 4: Interactive Workloads.** When transactions spend more time
53 |   "thinking", the system begins to behave more like it has infinite resources
54 |   and the optimistic algorithm performs best.
55 | 
56 | ## Transaction Behavior Assumptions
57 | 
58 | - **Experiment 6: Modeling Restarts.** Some analyses model a transaction
59 |   restart as the spawning of a completely new transaction. These fake restarts
60 |   lead to higher throughput because they avoid repeated transaction conflict.
61 | - **Experiment 7: Write-Lock Acquisition.** Some analyses have transactions
62 |   acquire read-locks and then upgrade them to write-locks. Others have
63 |   transactions immediately acquire write-locks if the object will ever be
64 |   written to. Upgrading locks can lead to deadlock if two transactions
65 |   concurrently write to the same object. The effect of lock upgrading varies
66 |   with the amount of available resources.
67 | 


--------------------------------------------------------------------------------
/papers/alvaro2010boom.md:
--------------------------------------------------------------------------------
 1 | ## [BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud (2010)](TODO) ##
 2 | **Summary.**
 3 | Programming distributed systems is hard. Really hard. This paper conjectures
 4 | that *data-centric* programming done with *declarative programming languages*
 5 | can lead to simpler distributed systems that are more correct with less code.
 6 | To support this conjecture, the authors implement an HDFS and Hadoop clone in
 7 | Overlog, dubbed BOOM-FS and BOOM-MR respectively, using orders of magnitude
 8 | fewer lines of code that the original implementations. They also extend BOOM-FS
 9 | with increased availability, scalability, and monitoring.
10 | 
11 | An HDFS cluster consists of NameNodes responsible for metadata management and
12 | DataNodes responsible for data management. BOOM-FS reimplements the metadata
13 | protocol in Overlog; the data protocol is implemented in Java. The
14 | implementation models the entire system state (e.g. files, file paths,
15 | heartbeats, etc.) as data in a unified way by storing them as collections. The
16 | Overlog implementation of the NameNode then operates on and updates these
17 | collections. Some of the data (e.g. file paths) are actually views that can be
18 | optionally materialized and incrementally maintained. After reaching (almost)
19 | feature parity with HDFS, the authors increased the availability of the
20 | NameNode by using Paxos to introduce a hot standby replicas. They also
21 | partition the NameNode metadata to increase scalability and use metaprogramming
22 | to implement monitoring.
23 | 
24 | BOOM-MR plugs in to the existing Hadoop code but reimplements two MapReduce
25 | scheduling algorithms: Hadoop's first-come first-server algorithm, and
26 | Zaharia's LATE policy.
27 | 
28 | BOOM Analytics was implemented in order of magnitude fewer lines of code thanks
29 | to the data-centric approach and the declarative programming language. The
30 | implementation is also almost as fast as the systems they copy.
31 | 


--------------------------------------------------------------------------------
/papers/alvaro2011consistency.md:
--------------------------------------------------------------------------------
 1 | ## [Consistency Analysis in Bloom: a CALM and Collected Approach (2011)](https://scholar.google.com/scholar?cluster=9165311711752272482&hl=en&as_sdt=0,5) ##
 2 | **Summary.**
 3 | Strong consistency eases reasoning about distributed systems, but it requires
 4 | coordination which entails higher latency and unavailability. Adopting weaker
 5 | consistency models can improve system performance but writing applications
 6 | against these weaker guarantees can be nuanced, and programmers don't have any
 7 | reasoning tools at their disposal. This paper introduces the *CALM conjecture*
 8 | and *Bloom*, a disorderly declarative programming language based on CALM, which
 9 | allows users to write loosely consistent systems in a more principled way.
10 | 
11 | Say we've eschewed strong consistency, how do we know our program is even
12 | eventually consistent? For example, consider a distributed register in which
13 | servers accept writes on a first come first serve basis. Two clients could
14 | simultaneously write two distinct values `x` and `y` concurrently. One server
15 | may receive `x` then `y`; the other `y` then `x`. This system is not eventually
16 | consistent. Even after client requests have quiesced, the distributed register
17 | is in an inconsistent state. The *CALM conjecture* embraces Consistency As
18 | Logical Monotonicity and says that logically monotonic programs are eventually
19 | consistent for any ordering and interleaving of message delivery and
20 | computation. Moreover, they do not require waiting or coordination to stream
21 | results to clients.
22 | 
23 | *Bud* (Bloom under development) is the realization of the CALM theorem. It is
24 | Bloom implemented as a Ruby DSL. Users declare a set of Bloom collections (e.g.
25 | persistent tables, temporary tables, channels) and an order independent set of
26 | declarative statements similar to Datalog. Viewing Bloom through the lens of
27 | its procedural semantics, Bloom execution proceeds in timesteps, and each
28 | timestep is divided into three phases. First, external messages are delivered
29 | to a node. Second, the Bloom statements are evaluated. Third, messages are sent
30 | off elsewhere. Bloom also supports modules and interfaces to improve
31 | modularity.
32 | 
33 | The paper also implements a key-value store and shopping cart using Bloom and
34 | uses various visualization tools to guide the design of coordination-free
35 | implementations.
36 | 


--------------------------------------------------------------------------------
/papers/armbrust2015spark.md:
--------------------------------------------------------------------------------
 1 | ## [Spark SQL: Relational Data Processing in Spark (2015)](https://scholar.google.com/scholar?cluster=12543149035101013955&hl=en&as_sdt=0,5)
 2 | **Summary.**
 3 | Data processing frameworks like MapReduce and Spark can do things that
 4 | relational databases can't do very easily. For example, they can operate over
 5 | semi-structured or unstructured data, and they can perform advanced analytics.
 6 | On the other hand, Spark's API allows user to run arbitrary code (e.g.
 7 | `rdd.map(some_arbitrary_function)`) which prevents Spark from performing
 8 | certain optimizations. Spark SQL marries imperative Spark-like data processing
 9 | with declarative SQL-like data processing into a single unified interface.
10 | 
11 | Spark's main abstraction was an RDD. Spark SQL's main abstraction is a
12 | *DataFrame*: the Spark analog of a table which supports a nested data model of
13 | standard SQL types as well as structs, arrays, maps, unions, and user defined
14 | types. DataFrames can be manipulated as if they were RDDs of row objects (e.g.
15 | `dataframe.map(row_func)`), but they also support a set of standard relational
16 | operators which take ASTs, built using a DSL, as arguments. For example, the
17 | code `users.where(users("age") < 40)` constructs an AST from `users("age") <
18 | 40` as an argument to filter the `users` DataFrame. By passing in ASTs as
19 | arguments rather than arbitrary user code, Spark is able to perform
20 | optimizations it previously could not do. DataFrames can also be queries using
21 | SQL.
22 | 
23 | Notably, integrating queries into an existing programming language (e.g. Scala)
24 | makes writing queries much easier. Intermediate subqueries can be reused,
25 | queries can be constructed using standard control flow, etc. Moreover, Spark
26 | eagerly typechecks queries even though their execution is lazy. Furthermore,
27 | Spark SQL allows users to create DataFrames of language objects (e.g. Scala
28 | objects), and UDFs are just normal Scala functions.
29 | 
30 | DataFrame queries are optimized and manipulated by a new extensible query
31 | optimizer called *Catalyst*. The query optimizer manipulates ASTs written in
32 | Scala using *rules*, which are just functions from trees to trees that
33 | typically use pattern matching. Queries are optimized in four phases:
34 | 
35 | 1. *Analysis.* First, relations and columns are resolved, queries are
36 |    typechecked, etc.
37 | 2. *Logical optimization.* Typical logical optimizations like constant folding,
38 |    filter pushdown, boolean expression simplification, etc are performed.
39 | 3. *Physical planning.* Cost based optimization is performed.
40 | 4. *Code generation.* Scala quasiquoting is used for code generation.
41 | 
42 | Catalyst also makes it easy for people to add new data sources and user defined
43 | types.
44 | 
45 | Spark SQL also supports schema inference, ML integration, and query federation:
46 | useful features for big data.
47 | 
48 | 


--------------------------------------------------------------------------------
/papers/bailis2013highly.md:
--------------------------------------------------------------------------------
 1 | ## [Highly Available Transactions: Virtues and Limitations (2014)](TODO) ##
 2 | **Summary.**
 3 | Serializability is the gold standard of consistency, but databases have always
 4 | provided weaker consistency modes (e.g. Read Committed, Repeatable Read) that
 5 | promise improved performance. In this paper, Bailis et al. determine which of
 6 | these weaker consistency models can be implemented with high availability.
 7 | 
 8 | First, why is high availability important?
 9 | 
10 | 1. *Partitions.* Partitions happen, and when they do non-available systems
11 |    become, well, unavailable.
12 | 2. *Latency.* Partitions may be transient, but latency is forever. Highly
13 |    available systems can avoid latency by eschewing coordination costs.
14 | 
15 | Second, are weaker consistency models consistent enough? In short, yeah
16 | probably. In a survey of databases, Bailis finds that many do not employ
17 | serializability by default and some do not even provide full serializability.
18 | Bailis also finds that four of the five transactions in the TPC-C benchmark can
19 | be implemented with highly available transactions.
20 | 
21 | After defining availability, Bailis presents the taxonomy of which consistency
22 | can be implemented as HATs, and also argues why some fundamentally cannot. He
23 | also performs benchmarks on AWS to show the performance benefits of HAT.
24 | 


--------------------------------------------------------------------------------
/papers/bailis2014coordination.md:
--------------------------------------------------------------------------------
 1 | ## [Coordination Avoidance in Database Systems (2014)](https://scholar.google.com/scholar?cluster=428435405994413003&hl=en&as_sdt=0,5)
 2 | **Overivew.**
 3 | Coordination in a distributed system is sometimes necessary to maintain
 4 | application correctness, or *consistency*. For example, a payroll application
 5 | may require that each employee has a unique ID, or that a jobs relation only
 6 | include valid employees. However, coordination is not cheap.  It increases
 7 | latency, and in the face of partitions can lead to unavailability.  Thus, when
 8 | application correctness permits, coordination should be avoided. This paper
 9 | develops the necessary and sufficient conditions for when coordination is
10 | needed to maintain a set of database invariance using a notion of
11 | invariant-confluence or I-confluence.
12 | 
13 | **System Model.**
14 | A *database state* is a set D of object versions drawn from the set of all
15 | states *D*. Transactions operate on *logical replicas* in *D* that contain the
16 | set of object versions relevant to the transaction. Transactions are modeled as
17 | functions T : *D* -> *D*. The effects of a transaction are merged into an
18 | existing replica using an associative, commutative, idempotent merge operator.
19 | Changes are shared between replicas and merges likewise. In this paper, merge
20 | is set union, and we assume we know all transactions in advance. Invariants are
21 | modeled as boolean functions I: *D* -> 2. A state R is said to be I-valid if
22 | I(R) is true.
23 | 
24 | We say a system has *transactional availability* if whenever a transaction T
25 | can contact servers with the appropriate data in T, it only aborts if T chooses
26 | to abort. We say a system is *convergent* if after updates quiesce, all servers
27 | eventually have the same state. A system is globally *I-valid* if all replicas
28 | always have I-valid states. A system provides coordination-free execution if
29 | execution of a given transaction does not depend on the execution of others.
30 | 
31 | **Consistency Sans Coordination.**
32 | A state Si is *I-T-Reachable* if its derivable from I-valid states with
33 | transactions in T. A set of transactions is *I*-confluent with respect to
34 | invariant I if for all I-T-Reachable states Di, Dj with common ancestor Di join
35 | Dj is I-valid. A globally I-valid system can execute a set of transactions T
36 | with global validity, transactional availability, convergence, and
37 | coordination-freedom if and only if T is I-confluent with respect to I.
38 | 
39 | **Applying Invariant-Confluence.**
40 | I-confluence can be applied to existing relation operators and constraints. For
41 | example, updates, inserts, and deletes are I-confluent with respect to
42 | per-record inequality constraint. Deletions are I-confluent with respect to
43 | foreign key constraints; additions and updates are not. I-confluence can also
44 | be applied to abstract data types like counters.
45 | 
46 | 


--------------------------------------------------------------------------------
/papers/barham2003xen.md:
--------------------------------------------------------------------------------
 1 | ## [Xen and the Art of Virtualization (2003)](https://scholar.google.com/scholar?cluster=11605682627859750448&hl=en&as_sdt=0,5)
 2 | **Summary.**
 3 | Many virtual machine monitors, or *hypervisors*, aim to run unmodified guest
 4 | operating systems by presenting a completely virtual machine. This lets any OS
 5 | run on the hypervisor but comes with a significant performance penalty. Xen is
 6 | an x86 hypervisor that uses *paravirtualization* to reduce virtualization
 7 | overheads. Unlike with full virtualization, paravirtualization only virtualizes
 8 | some components of the underlying machine. This paravirtualization requires
 9 | modifications to the guest operating systems but not the applications running
10 | on it. Essentially, Xen sacrifices the ability to run unmodified guest
11 | operating systems for improved performance.
12 | 
13 | There are three components that need to be paravirtualized:
14 | 
15 | - *Memory management.* Software page tables and tagged page tables are easier
16 |   to virtualize. Unfortunately, x86 has neither. Xen paravirtualizes the
17 |   hardware accessible page tables leaving guest operating systems responsible
18 |   for managing them. Page table modifications are checked by Xen.
19 | - *CPU.* Xen takes advantage of x86's four privileges, called *rings*. Xen runs
20 |   at ring 0 (the most privileged ring), the guest OS runs at ring 1, and the
21 |   applications running in the guest operating systems run at ring 3.
22 | - *Device I/O.* Guest operating systems communicate with Xen via bounded
23 |   circular producer/consumer buffers. Xen communicates to guest operating
24 |   systems using asynchronous notifications.
25 | 
26 | The Xen hypervisor implements mechanisms. Policy is delegated to a privileged
27 | domain called dom0 that has accessed to privileges that other domains don't.
28 | 
29 | Finally, a look at some details about Xen:
30 | 
31 | - *Control transfer.* Guest operating systems request services from the
32 |   hypervisor via *hypercalls*. Hypercalls are like system calls except they are
33 |   between a guest operating system and a hypervisor rather than between an
34 |   application and an operating system. Furthermore, each guest OS registers
35 |   interrupt handlers with Xen. When an event occurs, Xen toggles a bitmask to
36 |   indicate the type of event before invoking the registered handler.
37 | - *Data transfer.* As mentioned earlier, data transfer is performed using a
38 |   bounded circular buffer of I/O descriptors. Requests and responses are pushed
39 |   on to the buffer. Requests can come out of order with respect to the
40 |   requests. Moreover, requests and responses can be batched.
41 | - *CPU Scheduling.* Xen uses the BVT scheduling algorithm.
42 | - *Time and timers.* Xen supports real time (the time in nanoseconds from
43 |   machine boot), virtual time (time that only increases when a domain is
44 |   executing), and clock time (an offset added to the real time).
45 | - *Virtual address translation.* Other hypervisors present a virtual contiguous
46 |   physical address space on top of the actual hardware address space. The
47 |   hypervisor maintains a shadow page table mapping physical addresses to
48 |   hardware addresses and installs real page tables into the MMU. This has high
49 |   overhead. Xen takes an alternate approach. Guest operating systems issue
50 |   hypercalls to manage page table entries that are directly inserted into the
51 |   MMU's page table. After they are installed, the page table entries are
52 |   read-only.
53 | - *Physical memory.* Memory is physically partitioned into reservations.
54 | - *Network.* Xen provides virtual firewall-routers with one or more virtual
55 |   network interfaces, each with a circular ring buffer.
56 | - *Disk.* Xen presents virtual block devices each with a ring buffer.
57 | 


--------------------------------------------------------------------------------
/papers/baumann2015shielding.md:
--------------------------------------------------------------------------------
 1 | ## [Shielding Applications from an Untrusted Cloud with Haven (2014)](https://scholar.google.com/scholar?cluster=12325554201123386346&hl=en&as_sdt=0,5)
 2 | **Summary.**
 3 | When running an application in the cloud, users have to trust (i) the cloud
 4 | provider's software, (ii) the cloud provider's staff, and (iii) law enforcement
 5 | with the ability to access user data. Intel SGX partially solves this problem
 6 | by allowing users to run small portions of program on remote servers with
 7 | guarantees of confidentiality and integrity. Haven leverages SGX and Drawbridge
 8 | to run *entire legacy programs* with shielded execution.
 9 | 
10 | Haven assumes a very strong adversary which has access to all the system's
11 | software and most of the system's hardware. Only the processor and SGX hardware
12 | is trusted. Haven provides confidentiality and integrity, but not availability.
13 | It also does not prevent side-channel attacks.
14 | 
15 | There are two main challenges that Haven's design addresses. First, most
16 | programs are written assuming a benevolent host. This leads to Iago attacks in
17 | which the OS subverts the application by exploiting its assumptions about the
18 | OS. Haven must operate correctly despite a *malicious host*. To do so, Haven
19 | uses a library operation system LibOS that is part of a Windows sandboxing
20 | framework called Drawbridge. LibOS implements a full OS API using only a few
21 | core host OS primitives. These core host OS primitives are used in a defensive
22 | way. A shield module sits below LibOS and takes great care to ensure that LibOS
23 | is not susceptible to Iago attacks. The user's application, LibOS, and the
24 | shield module are all run in an SGX enclave.
25 | 
26 | Second, Haven aims to run *unmodified* binaries which were not written with
27 | knowledge of SGX. Real world applications allocate memory, load and run code
28 | dynamically, etc. Many of these things are not supported by SGX, so Haven (a)
29 | emulated them and (b) got the SGX specification revised to address them.
30 | 
31 | Haven also implements an in-enclave encrypted file system in which only the
32 | root and leaf pages need to be written to stable storage. As of publication,
33 | however, Haven did not fully implement this feature. Haven is susceptible to
34 | replay attacks.
35 | 
36 | Haven was evaluated by running Microsoft SQL Server and Apache HTTP Server.
37 | 
38 | 


--------------------------------------------------------------------------------
/papers/bershad1995spin.md:
--------------------------------------------------------------------------------
 1 | ## [SPIN -- An Extensible Microkernel for Application-specific Operating System Services (1995)](https://scholar.google.com/scholar?cluster=4910839957971330989&hl=en&as_sdt=0,5)
 2 | **Summary.**
 3 | Many operating systems were built a long time ago, and their performance was
 4 | tailored to the applications and workloads at the time. More recent
 5 | applications, like databases and multimedia applications, are quite different
 6 | than these applications and can perform quite poorly on existing operating
 7 | systems. SPIN is an extensible microkernel that allows applications to tailor
 8 | the operating system to meet their needs.
 9 | 
10 | Existing operating systems fit into one of three categories:
11 | 
12 | 1. They have no interface by which applications can modify kernel behavior.
13 | 2. They have a clean interface applications can use to modify kernel behavior
14 |    but the implementation of the interface is inefficient.
15 | 3. They have an unconstrained interface that is efficiently implemented but
16 |    does not provide isolation between applications.
17 | 
18 | SPIN provides applications a way to efficiently and safely modify the behavior
19 | of the kernel. Programs in SPIN are divided into the user-level code and a
20 | spindle: a portion of user code that is dynamically installed and run in the
21 | kernel. The kernel provides a set of abstractions for physical and logical
22 | resources, and the spindles are responsible for managing these resources. The
23 | spindles can also register to be invoked when certain kernel events (i.e. page
24 | faults) occur. Installing spindles directly into the kernel provides
25 | efficiency. Applications can execute code in the kernel without the need for a
26 | context switch.
27 | 
28 | To ensure safety, spindles are written in a typed object-oriented language.
29 | Each spindle is like an object; it contains local state and a set of methods.
30 | Some of these methods can be called by the application, and some are registered
31 | as callbacks in the kernel. A spindle checker uses a combination of static
32 | analysis and runtime checks to ensure that the spindles meet certain kernel
33 | invariants. Moreover, SPIN relies on advanced compiler technology to ensure
34 | efficient spindle compilation.
35 | 
36 | General purpose high-performance computing, parallel processing, multimedia
37 | applications, databases, and information retrieval systems can benefit from the
38 | application-specific services provided by SPIN. Using techniques such as
39 | 
40 | - extensible IPC;
41 | - application-level protocol processing;
42 | - fast, simple, communication;
43 | - application-specific file systems and buffer cache management;
44 | - user-level scheduling;
45 | - optimistic transaction;
46 | - real-time scheduling policies;
47 | - application-specific virtual memory; and
48 | - runtime systems with memory system feedback,
49 | 
50 | applications can be implemented more efficiently on SPIN than on traditional
51 | operating systems.
52 | 
53 | 


--------------------------------------------------------------------------------
/papers/brewer2012cap.md:
--------------------------------------------------------------------------------
 1 | # [CAP Twelve Years Later: How the "Rules" Have Changed (2012)](https://scholar.google.com/scholar?cluster=17642052422667212790)
 2 | The CAP theorem dictates that in the face of network partitions, replicated
 3 | data stores must choose between high availability and strong consistency. In
 4 | this 12 year retrospective, Eric Brewer takes a look back at the CAP theorem
 5 | and provides some insights.
 6 | 
 7 | ## Why 2 of 3 is Misleading
 8 | The CAP theorem is misleading for three reasons.
 9 | 
10 | 1. Partitions are rare, and when a system is not partitioned, the system can
11 |    have both strong consistency and high availability.
12 | 2. Consistency and availability can vary by subsystem or even by operation. The
13 |    granularity of consistency and availability doest not have to be an entire
14 |    system.
15 | 3. There are various degrees of consistency and various levels of consistency.
16 | 
17 | ## CAP-Latency Connection
18 | After a node experiences a delay when communicating with another node, it has
19 | to make a choice between (a) aborting the operation and sacrificing consistency
20 | of (b) continuing with the operation anyway and sacrificing consistency.
21 | Essentially, a partition is a time bound on communication. Viewing partitions
22 | like this leads to three insights:
23 | 
24 | 1. There is no global notion of a partition.
25 | 2. Nodes can detect partitions and enter a special partition mode.
26 | 3. Users can vary the time after which they consider the system partitioned.
27 | 
28 | ## Managing Partitions
29 | Systems should take three steps to handle partitions.
30 | 
31 | 1. Detect the partition.
32 | 2. Enter a special partition mode in which nodes either (a) limit the
33 |    operations which can proceed thereby decreasing availability or (b) continue
34 |    with the operations decreasing consistency, making sure to log enough
35 |    information to recover after the partition.
36 | 3. Recover from the partition once communication resumes.
37 | 
38 | ## Which Operations Should Proceed
39 | The operations which a node permits during a partition depends on the
40 | invariants it is willing to sacrifice. For example, nodes may temporarily
41 | violate unique id constraints during a partition since they are easy to detect
42 | and resolve. Other invariants are too important to violate, so operations that
43 | could potentially violate them are stalled.
44 | 
45 | ## Partition Recovery
46 | Once a system recovers from a partition it has to
47 | 
48 | 1. make the state consistent again, and
49 | 2. compensate any mistakes made during the partition.
50 | 
51 | Sometimes a system is unable to automatically make the state consistent and
52 | depends on manual intervention. Sometimes, the system can automatically restore
53 | the state because it carefully rejected some operations during the partition.
54 | Other systems can automatically restore consistency because they use clever
55 | data structures like CRDTs.
56 | 
57 | Some systems, especially those which externalize actions (e.g. ATMs), must
58 | sometimes issue compensations (e.g. emailing users).
59 | 


--------------------------------------------------------------------------------
/papers/brin1998anatomy.md:
--------------------------------------------------------------------------------
 1 | # [The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)](https://scholar.google.com/scholar?cluster=9820961755208603037)
 2 | In this paper, Larry and Sergey present Google.
 3 | 
 4 | ## Features
 5 | One of Google's main features is its use of PageRank to rank search results.
 6 | For an arbitrary site $A$ and a set of sites $T_1, \ldots, T_n$ that link to
 7 | it, the PageRank function satisfies the following equation:
 8 | 
 9 | $$
10 | PageRank(A) = (1 - d) + d \sum_{i=1}^n \frac{PageRank(T_i)}{OutDegree(T_i)}
11 | $$
12 | 
13 | where $d$ is a damping factor. Intuitively, sites get a high PageRank if they
14 | are frequently linked to or linked to by other sites with high PageRank.
15 | 
16 | Google also associates anchor text (the text part of a link) with the page
17 | being linked to instead of the page with the link. This anchor text provides
18 | information about a site which the site itself may not include.
19 | 
20 | Google also provides a number of other miscellaneous features. For example, it
21 | scores words in documents based on their size and boldness. It also maintains a
22 | repository of HTML that researchers can use as a dataset.
23 | 
24 | ## Related Work
25 | Most related work is in the field of information retrieval, but this literature
26 | typically assumes that information is being extracted from a small set of
27 | homogeneous documents. The web is huge and varied, and users can even
28 | maliciously try to affect search engine results.
29 | 
30 | ## System Anatomy
31 | A URL Server sends URLs to a set of crawlers which download sites and send them
32 | to a store server which compresses them and stores them in a repository. An
33 | indexer uses the repository to build a forward index (stored in a set of
34 | barrels) which maps documents to a set of hits. Each hit records a word in the
35 | document, its position in the document, its boldness, etc. The indexer also
36 | stores all anchors in a file which a URL resolver uses to build a graph for
37 | PageRank. A sorter periodically converts forward indexes into inverted indexes.
38 | 
39 | ## Data Structures
40 | - **BigFiles.** Big files are really big virtual files that map over multiple file systems.
41 | - **Repository.** The repository contains the HTML of crawled sites compressed
42 |   with zlib. The repository contains a record for each site which contains its
43 |   DocId, URL, length, and HTML contents.
44 | - **Document index.** The index is an ISAM index from DocId to pointers into
45 |   the repository. It also contains pointers to the URL and title of the site.
46 |   Additionally, there is a hash map which maps URLs to DocIds.
47 | - **Lexicon.** The lexicon is an in-memory list of roughly 14 million words.
48 |   The words are stored contiguously with nulls separating them.
49 | - **Hit lists.** Each hit needs to compactly represent the font size,
50 |   capitalization, position, etc of a word. Google uses some custom bit tricks
51 |   to do this with very few bits.
52 | - **Forward index.** The forward index contains a list of DocIds followed by a
53 |   WordId and a series of hits. The index is partitioned by WordId across a
54 |   number of barrels.
55 | - **Inverted index.** The inverted index maps words to DocIds.
56 | 
57 | ## Crawling the Web
58 | The URL server and the crawlers are implemented in Python. Each crawler has
59 | about 300 open connections, performs event loop asynchronous IO, and maintains
60 | a local DNS cache to boost performance. Crawling is the most fragile part of
61 | the entire search engine and requires a lot of corner case handling.
62 | 
63 | ## Indexing
64 | Google implements a custom parser to handle exotic HTML. The HTML is converted
65 | into a set of hits using the lexicon. New words not in the lexicon are logged
66 | and later added to the lexicon.
67 | 
68 | ## Search
69 | A search query is converted into a set of WordIds. The index is searched to
70 | find a set of documents that contain the words. The results are ranked by a
71 | number of factors (e.g. PageRank, proximity for range searches) before being
72 | returned to the user.
73 | 
74 | <script type="text/javascript" async
75 |   src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML">
76 | </script>
77 | 


--------------------------------------------------------------------------------
/papers/bugnion1997disco.md:
--------------------------------------------------------------------------------
 1 | ## [Disco: Running Commodity Operating Systems on Scalable Multiprocessors (1997)](https://scholar.google.com/scholar?cluster=17298410582406300869&hl=en&as_sdt=0,5)
 2 | **Summary.**
 3 | Operating systems are complex, million line code bases. Multiprocessors were
 4 | becoming popular, but it was too difficult to modify existing commercial
 5 | operating systems to take full advantage of the new hardware. Disco is a
 6 | *virtual machine monitor*, or *hypervisor*, that uses virtualization to run
 7 | commercial virtual machines on cache-coherent NUMA multiprocessors. Guest
 8 | operating systems running on Disco are only slightly modified, yet are still
 9 | able to take advantage of the multiprocessor. Moreover, Disco offers all the
10 | traditional benefits of a hypervisor (e.g. fault isolation).
11 | 
12 | Disco provides the following interfaces:
13 | 
14 | - *Processors.* Disco provides full virtualization of the CPU allowing for
15 |   restricted direct execution. Some privileged registers are mapped to memory
16 |   to allow guest operating systems to read them.
17 | - *Physical memory.* Disco virtualizes the guest operating system's physical
18 |   address spaces, mapping them to hardware addresses. It also supports page
19 |   migration and replication to alleviate the non-uniformity of a NUMA machine.
20 | - *I/O devices.* All I/O communication is simulated, and various virtual disks
21 |   are used.
22 | 
23 | Disco is implemented as follows:
24 | 
25 | - *Virtual CPUs.* Disco maintains the equivalent of a process table entry for
26 |   each guest operating system. Dicso runs in kernel mode, guest operating
27 |   systems run in supervisor mode, and applications run in user mode.
28 | - *Virtual physical memory.* To avoid the overhead of physical to hardware
29 |   address translation, Disco maintains a large software physical to hardware
30 |   TLB.
31 | - *NUMA.* Disco migrates pages to the CPUs that access them frequently, and
32 |   replicates read-only pages to the CPUs that read them frequently. This
33 |   dynamic page migration and replications helps mask the non-uniformity of a
34 |   NUMA machine.
35 | - *Copy-on-write disks.* Disco can map physical addresses in different guest
36 |   operating systems to a read-only page in hardware memory. This lowers the
37 |   memory footprint of running multiple guest operating systems. The shared
38 |   pages are copy-on-write.
39 | - *Virtual network interfaces.* Disco runs a virtual subnet over which guests
40 |   can communicate using standard communication protocols like NFS and TCP.
41 |   Disco uses a similar copy-on-write trick as above to avoid copying data
42 |   between guests.
43 | 


--------------------------------------------------------------------------------
/papers/burns2016borg.md:
--------------------------------------------------------------------------------
 1 | ## [Borg, Omega, and Kubernetes (2016)](#borg-omega-and-kubernetes-2016)
 2 | **Summary.**
 3 | Google has spent the last decade developing three container management systems.
 4 | *Borg* is Google's main cluster management system that manages long running
 5 | production services and non-production batch jobs on the same set of machines
 6 | to maximize cluster utilization. *Omega* is a clean-slate rewrite of Borg using
 7 | more principled architecture. In Omega, all system state lives in a consistent
 8 | Paxos-based storage system that is accessed by a multitude of components which
 9 | act as peers. *Kubernetes* is the latest open source container manager that
10 | draws on lessons from both previous systems.
11 | 
12 | All three systems use containers for security and performance isolation.
13 | Container technology has evolved greatly since the inception of Borg from
14 | chroot to jails to cgroups. Of course containers cannot prevent all forms of
15 | performance isolation. Today, containers also contain program images.
16 | 
17 | Containers allow the cloud to shift from a machine-oriented design to an
18 | application oriented-design and tout a number of advantages.
19 | 
20 | - The gap between where an application is developed and where it is deployed is
21 |   shrunk.
22 | - Application writes don't have to worry about the details of the operating
23 |   system on which the application will run.
24 | - Infrastructure operators can upgrade hardware without worrying about breaking
25 |   a lot of applications.
26 | - Telemetry is tied to applications rather than machines which improves
27 |   introspection and debugging.
28 | 
29 | Container management systems typically also provide a host of other niceties
30 | including:
31 | 
32 | - naming services,
33 | - autoscaling,
34 | - load balancing,
35 | - rollout tools, and
36 | - monitoring tools.
37 | 
38 | In borg, these features were integrated over time in ad-hoc ways. Kubernetes
39 | organizes these features under a unified and flexible API.
40 | 
41 | Google's experience has led a number of things to avoid:
42 | 
43 | - Container management systems shouldn't manage ports. Kubernetes gives each
44 |   job a unique IP address allowing it to use any port it wants.
45 | - Containers should have labels, not just numbers. Borg gave each task an index
46 |   within its job. Kubernetes allows jobs to be labeled with key-value pairs and
47 |   be grouped based on these labels.
48 | - In Borg, every task belongs to a single job. Kubernetes makes task management
49 |   more flexible by allowing a task to belong to multiple groups.
50 | - Omega exposed the raw state of the system to its components. Kubernetes
51 |   avoids this by arbitrating access to state through an API.
52 | 
53 | Despite the decade of experience, there are still open problems yet to be
54 | solved:
55 | 
56 | - Configuration. Configuration languages begin simple but slowly evolve into
57 |   complicated and poorly designed Turing complete programming languages. It's
58 |   ideal to have configuration files be simple data files and let real
59 |   programming languages manipulate them.
60 | - Dependency management. Programs have lots of dependencies but don't manually
61 |   state them. This makes automated dependency management very tough.
62 | 


--------------------------------------------------------------------------------
/papers/codd1970relational.md:
--------------------------------------------------------------------------------
 1 | # [A Relational Model of Data for Large Shared Data Banks (1970)](https://scholar.google.com/scholar?cluster=1624408330930846885&hl=en&as_sdt=0,5)
 2 | 
 3 | ## Summary
 4 | In this paper, Ed Codd introduces the *relational data model*. Codd begins by
 5 | motivating the importance of *data independence*: the independence of the way
 6 | data is queried and the way data is stored. He argues that existing database
 7 | systems at the time lacked data independence; namely, the ordering of
 8 | relations, the indexes on relations, and the way the data was accessed was all
 9 | made explicit when the data was queried. This made it impossible for the
10 | database to evolve the way data was stored without breaking existing programs
11 | which queried the data. The relational model, on the other hand, allowed for a
12 | much greater degree of data independence. After Codd introduces the relational
13 | model, he provides an algorithm to convert a relation (which may contain other
14 | relations) into *first normal form* (i.e. relations cannot contain other
15 | relations). He then describes basic relational operators, data redundancy, and
16 | methods to check for database consistency.
17 | 
18 | ## Commentary
19 | 1. Codd's advocacy for data independence and a declarative query language have
20 |    stood the test of time. I particularly enjoy one excerpt from the paper
21 |    where Codd says, "The universality of the data sublanguage lies in its
22 |    descriptive ability (not its computing ability)".
23 | 2. Database systems at the time generally had two types of data: collections
24 |    and links between those collections. The relational model represented both
25 |    as relations. Today, this seems rather mundane, but I can imagine this being
26 |    counterintuitive at the time.  This is also yet another example of a
27 |    *unifying interface* which is demonstrated in both the Unix and System R
28 |    papers.
29 | 
30 | 
31 | 


--------------------------------------------------------------------------------
/papers/conway2012logic.md:
--------------------------------------------------------------------------------
 1 | ## [Logic and Lattices for Distributed Programming (2012)](TODO) ##
 2 | **Summary.**
 3 | CRDTs provide eventual consistency without the need for coordination. However,
 4 | they suffer a *scope problem*: simple CRDTs are easy to reason about and use,
 5 | but more complicated CRDTs force programmers to ensure they satisfy semilattice
 6 | properties. They also lack composability. Consider, for example, a Students set
 7 | and a Teams set. (Alice, Bob) can be added to Teams while concurrently Bob is
 8 | removed from Students. Each individual set may be a CRDT, but there is no
 9 | mechanism to enforce consistency between the CRDTs.
10 | 
11 | Bloom and CALM, on the other hand, allow for mechanized program analysis to
12 | guarantee that a program can avoid coordination. However, Bloom suffers from a
13 | *type problem*: it only operates on sets which procludes the use of other
14 | useful structures such as integers.
15 | 
16 | This paper merges CRDTs and Bloom together by introducing *bounded join
17 | semilattices* into Bloom to form a new language: Bloom^L. Bloom^L operates over
18 | semilattices, applying semilattice methods. These methods can be designated as
19 | non-monotonic, monotonic, or homomorphic (which implies monotonic). So long as
20 | the program avoids non-monotonic methods, it can be realized without
21 | coordination. Moreover, morphisms can be implemented more efficiently than
22 | non-homomorphic monotonic methods. Bloom^L comes built in with a family of
23 | useful semilattices like booleans ordered by implication, integers ordered by
24 | less than and greater than, sets, and maps. Users can also define their own
25 | semilattices, and Bloom^L allows smooth interoperability between Bloom
26 | collections and Bloom^L lattices. Bloom^L's semi-naive implementation is
27 | comparable to Bloom's semi-naive implementation.
28 | 
29 | The paper also presents two case-studies. First, they implement a key-value
30 | store as a map from keys to values annotated with vector clocks: a design
31 | inspired from Dynamo. They also implement a purely monotonic shopping cart
32 | using custom lattices.
33 | 


--------------------------------------------------------------------------------
/papers/engler1995exokernel.md:
--------------------------------------------------------------------------------
 1 | ## [Exokernel: An Operating System Architecture for Application-Level Resource Management (1995)](https://scholar.google.com/scholar?cluster=4636448334605780007&hl=en&as_sdt=0,5)
 2 | **Summary.**
 3 | Monolithic kernels provide a large number of abstractions (e.g. processes,
 4 | files, virtual memory, interprocess communication) to applications.
 5 | Microkernels push some of this functionality into user space but still provide
 6 | a fixed set of abstractions and services. Providing these inextensible fixed
 7 | abstractions is detrimental to applications.
 8 | 
 9 | - An application cannot be best for all applications. Tradeoffs must be made
10 |   which can impact performance for some applications.
11 | - A rigid set of abstractions can make it difficult for an application to layer
12 |   on its own set of abstractions. For example, a user level threads package may
13 |   encounter some difficulties of not having access to page faults.
14 | - Having a rigid set of abstractions means the abstractions are rarely updated.
15 |   Innovative OS features are seldom integrated into real world OSes.
16 | 
17 | The exokernel operating system architecture takes a different approach. It
18 | provides protected access to hardware and nothing else. All abstractions are
19 | implemented by library operating systems. The exokernel purely provides
20 | protected access to the unabstracted underlying hardware.
21 | 
22 | The exokernel interface governs how library operating systems get, use, and
23 | release resources. Exokernels follow the following guidelines.
24 | 
25 | - *Securely expose hardware.* All the details of the hardware (e.g. privileged
26 |   instructions, DMA) should be exposed to libOSes.
27 | - *Expose allocation.* LibOSes should be able to request physical resources.
28 | - *Expose physical names.* The physical names of resources (e.g. physical page
29 |   5) should be exposed.
30 | - *Expose revocation.* LibOSes should be notified when resources are revoked.
31 | 
32 | Exokernels use three main techniques to ensure protected access to the
33 | underlying hardware.
34 | 
35 | 1. *Secure bindings.* A secure binding decouples authorization from use and is
36 |    best explained through an example. A libOS can request that a certain entry
37 |    be inserted into a TLB. The exokernel can check that the entry is valid.
38 |    This is the authorization. Later, the CPU can use the TLB without any
39 |    checking. This is use. The TLB entry can be used multiple times after being
40 |    authorized only once.
41 | 
42 |    There are three forms of secure bindings. First are hardware mechanism like
43 |    the TLB entries or screens in which each pixel is tagged with a process.
44 |    Second are software mechanisms like TLB caching or packet filters. Third is
45 |    downloading and executing code using type-safe languages, interpretation, or
46 |    sandboxing. Exokernels can download Application-Specific Safe Handlers
47 |    (ASHes).
48 | 2. *Visible resource revocation.* In traditional operating systems, resource
49 |    revocation is made invisible to applications. For example, when an
50 |    application's page is swapped to disk, it is not notified. The exokernel
51 |    makes resource revocation visible by notifying the libOS. For example, each
52 |    libOS is notified when its quantum is over. This allows it do things like
53 |    only store the registers it needs.
54 | 3. *Abort protocol.* If a libOS is misbehaving and not responding to revocation
55 |    requests, the exokernel can forcibly remove allocations. Naively, it could
56 |    kill the libOS. Less naively, ti can simply remove all secure bindings.
57 | 
58 | The paper also presents the Aegis exokernel and the ExOS library operating
59 | system.
60 | 


--------------------------------------------------------------------------------
/papers/golub1992microkernel.md:
--------------------------------------------------------------------------------
 1 | # [Microkernel Operating System Architecture and Mach (1992)](https://scholar.google.com/scholar?cluster=1074648542567860981)
 2 | A **microkernel** is a very minimal, very low-level piece of code that
 3 | interfaces with hardware to implement the functionality needed for an operating
 4 | system.  Operating systems implemented using a microkernel architecture, rather
 5 | than a monolithic kernel architecture, implement most of the operating system
 6 | in user space on top of the microkernel.  This architecture affords many
 7 | advantages including:
 8 | 
 9 | - **tailorability**: many operating systems can be run on the same microkernel
10 | - **portability**: most hardware-specific code is in the microkernel
11 | - **network accessibility**: operating system services can be provided over the
12 |   network
13 | - **extensibility**: new operating system environments can be tested along side
14 |   existing ones
15 | - **real-time**: the kernel does not hold locks for very long
16 | - **multiprocessor support**: microkernel operations can be parallelized across
17 |   processors
18 | - **multicomputer support**: microkernel operations can be parallelized across
19 |   computers
20 | - **security**: a microkernel is a small trusted computing base
21 | 
22 | This paper describes various ways in which operating systems can be implemented
23 | on top of the Mach microkernel. Mach's key features include:
24 | 
25 | - **task and thread management**: Mach supports tasks (i.e. processes) and
26 |   threads. Mach implements a thread scheduler, but privileged user level
27 |   programs can alter the thread scheduling algorithms.
28 | - **interprocess communication**: Mach implements a capabilities based IPC
29 |   mechanism known as ports. Every object in Mach (e.g. tasks, threads, memory)
30 |   is managed by sending message to its corresponding port.
31 | - **memory object management**: Memory is represented as memory objects managed
32 |   via ports.
33 | - **system call redirection**: Mach allows a number of system calls to be
34 |   caught and handled by user level code.
35 | - **device support**: Devices are represented as ports.
36 | - **user multiprocessing**: Tasks can use a user-level threading library.
37 | - **multicomputer support**: Mach abstractions can be transparently implemented
38 |   on distributed hardware.
39 | - **Continuations**: In a typical operating system, when a thread blocks, all
40 |   of its registers are stored somewhere before another piece of code starts to
41 |   run. Its stack is also left intact. When the blocking thread is resumed, its
42 |   stored registers are put back in place and the thread starts running again.
43 |   This can be wasteful if the thread doesn't need all of the registers or its
44 |   stack. In Mach, threads can block with a continuation: an address and a bunch
45 |   of state. This can be more efficient since the thread only saves what it
46 |   needs to keep executing.
47 | 
48 | Many different operating systems can be built on top of Mach. It's ideal that
49 | applications built for these operating systems can continue to run unmodified
50 | even when the underlying OS is implemented on top of Mach. A key part of this
51 | virtualization is something called an **emulation library**. An emulation
52 | library is a piece of code inserted into an application's address space. When a
53 | program issues system call, Mach immediately redirects control flow to the
54 | emulation library to process it. The emulation library can then handle the
55 | system call by, for example, issuing an RPC to an operating system server.
56 | 
57 | Operating systems built on Mach can be architected in one of three ways:
58 | 
59 | 1. The entire operating system can be baked into the emulation library.
60 | 2. The operating system can be shoved into a single multithreaded Mach task.
61 |    This architecture can be very memory efficient, and is easy to implement
62 |    since the guest operating system can be ported with very little code
63 |    modification.
64 | 3. The operating system can be decomposed into a larger number of smaller
65 |    processes that communicate with one another using IPC. This architecture
66 |    encourages modularity, and allows certain operating system components to be
67 |    reused between operating systems. This approach can lead to inefficiency,
68 |    especially if IPC is not lighting fast!
69 | 


--------------------------------------------------------------------------------
/papers/graefe1990encapsulation.md:
--------------------------------------------------------------------------------
 1 | # [Encapsulation of Parallelism in the Volcano Query Processing System (1990)](https://scholar.google.com/scholar?cluster=12203514891599153151)
 2 | The Volcano query processing system uses the operator model of query execution
 3 | that we all know and love. In this paper, Graefe discusses the **exchange
 4 | operator**: an operator that you can insert into a query plan which
 5 | automatically parallelizes the execution of the query plan. The exchange
 6 | operator allows queries to be executed in parallel without having to rewrite
 7 | existing operator.
 8 | 
 9 | ## The Template Approach
10 | Parallel databases like Gamma and Bubba used a **template approach** to query
11 | parallelization. With the template approach, algorithms are plugged into
12 | templates which abstract away the details of where tuples come from and where
13 | they are sent to. They also abstract away the mechanisms used to send a tuple
14 | over the network or to another process using some IPC mechanism.
15 | Unfortunately, the template approach *required* that tuples either be sent over
16 | the network or over IPC which greatly reduced its performance.
17 | 
18 | ## Volcano Design
19 | Volcano implements queries as a tree of iterators with an open-next-close
20 | interface. The paper discusses something called state records which allow the
21 | same operator multiple times with different parameters, though this seems
22 | obsolete now that we have modern programming languages. Iterators return
23 | structures which contain record ids and pointers to the records which are
24 | pinned in the buffer pool.
25 | 
26 | ## The Operator Model of Parallelization
27 | In the operator model of parallelization, exchange operators are placed in
28 | query plans which automatically parallelize its execution. Exchange operators
29 | provide three forms of parallelism.
30 | 
31 | 1. **Pipeline parallelism.** The exchange operator allows its child and parent
32 |    to run in different processes. When opened, the exchange operator forks a
33 |    child and establishes a region of shared memory called a **port** which the
34 |    parent and child use to communicate with one another. The child part of the
35 |    exchange operator (known as the **producer**) continuously calls next on its
36 |    child, buffers the records into batches called packets, and writes them into
37 |    the port (with appropriate synchronization). The parent part of the exchange
38 |    operator (known as the consumer **consumer**) reads a record from a packet
39 |    in the port whenever its next method is invoked. For flow control, the
40 |    producer can decrement a semaphore and the producer can increment a
41 |    semaphore.
42 | 2. **Bushy parallelism.** Busy parallelism is when multiple siblings in a query
43 |    plan are executed in parallel. Bushy parallelism is really just pipeline
44 |    parallelism applied to multiple siblings.
45 | 3. **Intra-operator parallelism.** Intra-operator parallelism allows the same
46 |    operator to run in parallel on multiple partitions of input data. As before,
47 |    the exchange operator forks a (master) producer and establishes a port. This
48 |    master producer forks other producers and gives them the location of the
49 |    port. Each forked producer might eventually invoke another group of exchange
50 |    operators. The master consumer will fork a master producer and create a
51 |    port. Producers can send messages to producers using round-robin, hash, or
52 |    range partitioning. When a node is out of records, its sends an
53 |    end-of-stream message to all consumers. Consumers end their stream when they
54 |    count the appropriate number of end-of-stream messages. Unlike the pipelined
55 |    parallelism version of the exchange operator, the intra-operator version is
56 |    not so obvious how to implement. Moreover, I think it involves modifications
57 |    to the leaves of a query plan.
58 | 
59 | ## Variants of Exchange
60 | Producers can also broadcast tuples to consumers. A merge operator sits on top
61 | of an exchange operator and merges together the sorted streams produced by each
62 | producer. The paper also discusses sorting files and having exchange operators
63 | in the middle of a query fragment, but it's hard to understand.
64 | 


--------------------------------------------------------------------------------
/papers/halevy2016goods.md:
--------------------------------------------------------------------------------
 1 | ## [Goods: Organizing Google's Datasets (2016)](TODO) ##
 2 | **Summary.**
 3 | In fear of fettering development and innovation, companies often allow
 4 | engineers free reign to generate and analyze datasets at will. This often leads
 5 | to unorganized data lakes: a ragtag collection of datasets from a diverse set
 6 | of sources. Google Dataset Search (Goods) is a system which uses unobstructive
 7 | post-hoc metadata extraction and inference to organize Google's unorganized
 8 | datasets and present curated dataset information, such as metadata and
 9 | provenance, to engineers.
10 | 
11 | Building a system like Goods at Google scale presents many challenges.
12 | 
13 | - *Scale.* There are 26 billion datasets. *26 billion* (with a b)!
14 | - *Variety.* Data comes from a diverse set of sources (e.g. BigTable, Spanner,
15 |   logs).
16 | - *Churn.* Roughly 5% of the datasets are deleted everyday, and datasets are
17 |   created roughly as quickly as they are deleted.
18 | - *Uncertainty.* Some metadata inference is approximate and speculative.
19 | - *Ranking.* To facilitate useful dataset search, datasets have to be ranked by
20 |   importance: a difficult heuristic-driven process.
21 | - *Semantics.* Extracting the semantic content of a dataset is useful but
22 |   challenging. For example consider a file of protos that doesn't reference the
23 |   type of proto being stored.
24 | 
25 | The Goods catalog is a BigTable keyed by dataset name where each row contains
26 | metadata including
27 | 
28 | - *basic metatdata* like timestamp, owners, and access permissions;
29 | - *provenance* showing the lineage of each dataset;
30 | - *schema*;
31 | - *data summaries* extracted from source code; and
32 | - *user provided annotations*.
33 | 
34 | Moreover, similar datasets or multiple versions of the same logical dataset are
35 | grouped together to form *clusters*. Metadata for one element of a cluster can
36 | be used as metadata for other elements of the cluster, greatly reducing the
37 | amount of metadata that needs to be computed. Data is clustered by timestamp,
38 | data center, machine, version, and UID, all of which is extracted from dataset
39 | paths (e.g. `/foo/bar/montana/August01/foo.txt`).
40 | 
41 | In addition to storing dataset metadata, each row also stores *status
42 | metadata*: information about the completion status of various jobs which
43 | operate on the catalog. The numerous concurrently executing batch jobs use
44 | *status metadata* as a weak form of synchronization and dependency resolution,
45 | potentially deferring the processing of a row until another job has processed
46 | it.
47 | 
48 | The fault tolerance of these jobs is provided by a mix of job retries,
49 | BigTable's idempotent update semantics, and a watchdog that terminates
50 | divergent programs.
51 | 
52 | Finally, a two-phase garbage collector tombstones rows that satisfy a garbage
53 | collection predicate and removes them one day later if they still match the
54 | predicate. Batch jobs do not process tombstoned rows.
55 | 
56 | The Goods frontend includes dataset profile pages, dataset search driven by a
57 | handful of heuristics to rank datasets by importance, and teams dashboard.
58 | 


--------------------------------------------------------------------------------
/papers/hellerstein2010declarative.md:
--------------------------------------------------------------------------------
 1 | ## [The Declarative Imperative: Experiences and Conjectures in Distributed Logic (2010)](https://scholar.google.com/scholar?cluster=1374149560926608837&hl=en&as_sdt=0,5)
 2 | **Summary.**
 3 | With (a strict interpretation of) Moore's Law in decline and an overabundance
 4 | of compute resources in the cloud, performance necessitates parallelism. The
 5 | rub?  Parallel programming is difficult. Contemporaneously, Datalog (and
 6 | variants) have proven effective in an increasing number of domains from
 7 | networking to distributed systems. Better yet, declarative logic programming
 8 | allows for programs to be built with orders of magnitude less code and permits
 9 | formal reasoning. In this invited paper, Joe discusses his experiences with
10 | distributed declarative programming and conjectures some deep connections
11 | between logic and distributed systems.
12 | 
13 | Joe's group has explored many distributed Datalog variants, the latest of which
14 | is *Dedalus*. Dedalus includes notions of time: both intra-node atomicity and
15 | sequencing and inter-node causality. Every table in Dedalus includes a
16 | timestamp in its rightmost column. Dedalus' rules are characterized by how they
17 | interact with these timestamps:
18 | 
19 | - *Deductive rules* involve a single timestamp. They are traditional Datalog
20 |   stamements.
21 | - *Inductive rules* involve a head with a timestamp one larger than the
22 |   timestamps of the body. They represent the creation of facts from one point
23 |   in time to the next point in time.
24 | - *Asynchronous rules* involve a head with a non-deterministically chosen
25 |   timestamp. These rules capture the notion of non-deterministic message
26 |   delivery.
27 | 
28 | Joe's group's experience with distributed logic programming lead to the
29 | following conclusions:
30 | 
31 | - Datalog can very concisely express distributed algorithms that involved
32 |   recursive computations of transitive closures like web crawling.
33 | - Annotating relations with a location specifier column allows tables to be
34 |   transparently partitions and allows for declarative communication: a form of
35 |   "network data independence". This could permit many networking optimizations.
36 | - Stratifying programs based on timesteps introduces a notion of
37 |   transactionality. Every operation taking place in the same timestamp occurs
38 |   atomically.
39 | - Making all tables ephemeral and persisting data via explicit inductive rules
40 |   naturally allows transience in things like soft-state caches without
41 |   precluding persistence.
42 | - Treating events as a streaming join of inputs with persisted data is an
43 |   alternative to threaded or event-looped parallel programming.
44 | - Monotonic programs parallelize embarrassingly well. Non-monotonicity requires
45 |   coordination and coordination requires non-monotonicity.
46 | - Logic programming has its disadvantages. There is code redundancy; lack of
47 |   scope, encapsulation, and modularity; and providing consistent distributed
48 |   relations is difficult.
49 | 
50 | The experience also leads to a number of conjectures:
51 | 
52 | - The CALM conjecture stats that programs that can be expressed in monotonic
53 |   Datalog are exactly the programs that can be implemented with
54 |   coordination-free eventual consistency.
55 | - Dedalus' asynchronous rules allow for an infinite number of traces. Perhaps,
56 |   all these traces are confluent and result in the same final state. Or perhaps
57 |   they are all equivalent for some notion of equivalence.
58 | - The CRON conjecture states that messages sent to the past lead only to
59 |   paradoxes if the message has non-monotonic implications.
60 | - If computation today is so cheap, then the real computation cost comes from
61 |   coordination between strata. Thus, the minimum number of Dedalus timestamps
62 |   required to implement a program represents its minimum *coordination
63 |   complexity*.
64 | - To further decrease latency, programs can be computed approximately either by
65 |   providing probabilistic bounds on their outputs or speculatively executing
66 |   and fixing results in a post-hoc manner.
67 | 


--------------------------------------------------------------------------------
/papers/hellerstein2012madlib.md:
--------------------------------------------------------------------------------
 1 | # [The MADlib Analytics Library or MAD Skills, the SQL (2012)](https://scholar.google.com/scholar?cluster=2154261383124050736)
 2 | MADlib is a library of machine learning and statistics functions that
 3 | integrates into a relational database. For example, you can store labelled
 4 | training data in a relational database and run logistic regression over it like
 5 | this:
 6 | 
 7 | ```sql
 8 | SELECT madlib.logregr_train(
 9 |   'patients',                           -- source table
10 |   'patients_logregr',                   -- output table
11 |   'second_attack',                      -- labels
12 |   'ARRAY[1, treatment, trait_anxiety]', -- features
13 |   NULL,                                 -- grouping columns
14 |   20,                                   -- max number of iterations
15 |   'irls'                                -- optimizer
16 | );
17 | ```
18 | 
19 | MADlib programming is divided into two conceptual types of programming:
20 | macro-programming and micro-programming.  **Macro-programming** deals with
21 | partitioning matrices across nodes, moving matrix partitions, and operating on
22 | matrices in parallel. **Micro-programming** deals with writing efficient code
23 | which operates on a single chunk of a matrix on one node.
24 | 
25 | ## Macro-Programming
26 | MADlib leverages user-defined aggregates to operate on matrices in parallel. A
27 | user defined-aggregate over a set of type `T` comes in three pieces.
28 | 
29 | - A **transition function** of type `A -> T -> A` folds over the set.
30 | - A **merge function** of type `A -> A -> A` merges intermediate aggregates.
31 | - A **final function** `A -> B` translates the final aggregate.
32 | 
33 | Standard user-defined aggregates aren't sufficient to express a lot of machine
34 | learning algorithms. They suffer two main problems:
35 | 
36 | 1. User-defined aggregates cannot easily iterate over the same data multiple
37 |    times. Some solutions involve counted iteration by joining with virtual
38 |    tables, window aggregates, and recursion. MADlib elects to express iteration
39 |    using small bits of Python driver code that stores state between iterations
40 |    in temporary tables.
41 | 2. User-defined aggregates are not polymorphic. Each aggregate must explicitly
42 |    declare the type of its input, but some aggregates are pretty generic.
43 |    MADlib uses Python UDFs which generate SQL based on input type.
44 | 
45 | ## Micro-Programming
46 | MADlib user-defined code calls into fast linear algebra libraries (e.g. Eigen)
47 | for dense linear algebra. MADlib implements its own sparse linear algebra
48 | library in C. MADlib also provides a C++ abstraction for writing low-level
49 | linear algebra code. Notably, it translates C++ types into database types and
50 | integrates nicely with libraries like Eigen.
51 | 
52 | ## Examples
53 | Least squares regression can be computed in a single pass of the data. Logistic
54 | regression and k-means clustering require a Python driver to manage multiple
55 | iterations.
56 | 
57 | ## University Research
58 | Wisconsin implemented stochastic gradient descent in MADlib. Berkeley and
59 | Florida implemented some statistic text analytics features including text
60 | feature expansion, approximate string matching, Viterbi inference, and MCMC
61 | inference (though I don't know what any of these are).
62 | 
63 | <link href='../css/default_highlight.css' rel='stylesheet'>
64 | <script src="../js/highlight.pack.js"></script>
65 | <script>hljs.initHighlightingOnLoad();</script>
66 | 


--------------------------------------------------------------------------------
/papers/hindman2011mesos.md:
--------------------------------------------------------------------------------
1 | ## [Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center (2011)](https://scholar.google.com/scholar?cluster=816726489244916508&hl=en&as_sdt=0,5)
2 | See [`https://github.com/mwhittaker/mesos_talk`](https://github.com/mwhittaker/mesos_talk).
3 | 
4 | 


--------------------------------------------------------------------------------
/papers/kohler2000click.md:
--------------------------------------------------------------------------------
 1 | ## [The Click Modular Router (2000)](https://goo.gl/t1AlsN) ##
 2 | **Summary.**
 3 | Routers are more than routers. They are also firewalls, load balancers, address
 4 | translators, etc. Unfortunately, implementing these additional router
 5 | responsibilities is onerous; most routers are closed platforms with inflexible
 6 | designs. The Click router architecture, on the other hand, permits the creation
 7 | of highly modular and flexible routers with reasonable performance. Click is
 8 | implemented in C++ and compiles router specifications to routers which run on
 9 | general purpose machines.
10 | 
11 | Much like how Unix embraces modularity by composing simple *programs* using
12 | *pipes*, the Click architecture organizes a network of *elements* connected by
13 | *connections*. Each element represents an atomic unit of computation (e.g.
14 | counting packets) and is implemented as a C++ object that points to the other
15 | elements to which it is connected. Each element has input and output ports, can
16 | be provided arguments as configuration strings, and can expose arbitrary
17 | methods to other elements and to users.
18 | 
19 | Connections and ports are either *push-based* or *pull-based*. The source
20 | element of a push connection pushes data to the destination element. For
21 | example, when a network device receives a packet, it pushes it to other
22 | elements. Dually, the destination element of a pull connection can request to
23 | pull data from the source element or receive null if no such data is available.
24 | For example, when a network device is ready to be written to, it may pull data
25 | from other elements. Ports can be designated as pull, push, or agnostic. Pull
26 | ports must be matched with other pull ports, push ports must be matched with
27 | other push ports, and agnostic ports are labeled as pull or push ports during
28 | router initialization. *Queues* are packet storing elements with a push input
29 | port and pull output port; they form the interface between push and pull
30 | connections.
31 | 
32 | Some elements can process packets with purely local information. Other elements
33 | require knowledge of other elements. For example, a packet dropping element
34 | placed before a queue might integrate the length of the queue in its packet
35 | dropping policy. As a compromise between purely local information and complete
36 | global information, Click provides *flow-based router context* to elements
37 | allowing them to answer queries such as "what queue would a packet reach if I
38 | sent it out of my second port?".
39 | 
40 | Click routers are specified using a simple declarative configuration language.
41 | The language allows users to group elements into *compound elements* that
42 | behave as a single element.
43 | 
44 | The Click kernel driver is a single kernel thread that processes elements on a
45 | task queue. When an element is processed, it may in turn push data to or pull
46 | data from other elements forcing them to be processed. To avoid interrupt and
47 | device management overheads, the driver uses polling. Every input and output
48 | device polls for data whenever it is run by the driver. Router configurations
49 | are loaded and element methods are called via the `/proc` file system. The
50 | driver also supports hot-swapping in new router configurations which take over
51 | the state of the previous router.
52 | 
53 | The authors implement a fully compliant IP router using Click and explore
54 | various extensions to it including scheduling and dropping packets. The
55 | performance of the IP router is measured and analyzed.
56 | 


--------------------------------------------------------------------------------
/papers/kornacker2015impala.md:
--------------------------------------------------------------------------------
 1 | ## [Impala: A Modern, Open-Source SQL Engine for Hadoop (2015)](https://scholar.google.com/scholar?cluster=14277865292469814912&hl=en&as_sdt=0,5)
 2 | **Summary.**
 3 | Impala is a distributed query engine built on top of Hadoop. That is, it builds
 4 | off of existing Hadoop tools and frameworks and reads data stored in Hadoop
 5 | file formats from HDFS.
 6 | 
 7 | Impala's `CREATE TABLE` commands specify the location and file format of data
 8 | stored in Hadoop. This data can also be partitioned into different HDFS
 9 | directories based on certain column values. Users can then issue typical SQL
10 | queries against the data. Impala supports batch INSERTs but doesn't support
11 | UPDATE or DELETE. Data can also be manipulated directly by going through HDFS.
12 | 
13 | Impala is divided into three components.
14 | 
15 | 1. An Impala daemon (impalad) runs on each machine and is responsible for
16 |    receiving queries from users and for orchestrating the execution of queries.
17 | 2. A single Statestore daemon (statestored) is a pub/sub system used to
18 |    disseminate system metadata asynchronously to clients. The statestore has
19 |    weak semantics and doesn't persist anything to disk.
20 | 3. A single Catalog daemon (catalogd) publishes catalog information through the
21 |    statestored. The catalogd pulls in metadata from external systems, puts it
22 |    in Impala form, and pushes it through the statestored.
23 | 
24 | Impala has a Java frontend that performs the typical database frontend
25 | operations (e.g. parsing, semantic analysis, and query optimization). It uses a
26 | two phase query planner.
27 | 
28 | 1. *Single node planning.* First, a single-node non-executable query plan tree
29 |    is formed. Typical optimizations like join reordering are performed.
30 | 2. *Plan parallelization.* After a single node plan is formed, it is fragmented
31 |    and divided between multiple nodes with the goal of minimizing data movement
32 |    and maximizing scan locality.
33 | 
34 | Impala has a C++ backed that uses Volcano style iterators with exchange
35 | operators and runtime code generation using LLVM. To efficiently read data from
36 | disk, Impala bypasses the traditional HDFS protocols. The backend supports a
37 | lot of different file formats including Avro, RC, sequence, plain test, and
38 | Parquet.
39 | 
40 | For cluster and resource management, Impala uses a home grown Llama system that
41 | sits over YARN.
42 | 
43 | 


--------------------------------------------------------------------------------
/papers/lagar2009snowflock.md:
--------------------------------------------------------------------------------
 1 | ## [SnowFlock: Rapid Virtual Machine Cloning for Cloud Computing (2009)](https://scholar.google.com/scholar?cluster=3030124086251534312&hl=en&as_sdt=0,5)
 2 | **Summary.**
 3 | Public clouds like Amazon's EC2 or Google's Compute Engine allow users to
 4 | elastically spawn a huge number virtual machines on a huge number of physical
 5 | machines. However, spawning a VM can take on the order of minutes, and
 6 | typically spawned VMs are launched in some static initial state. SnowFlock
 7 | implements the VM fork abstraction, in which a parent VM forks a set of
 8 | children VMs all of which inherit a snapshot of the parent. Moreover, SnowFlock
 9 | implements this abstraction with subsecond latency. A subsecond VM fork
10 | implementation can be used for sandboxing, parallel computation (the focus of
11 | this paper), load handling, etc.
12 | 
13 | SnowFlock is built on top of Xen. Specifically, it is a combination of
14 | modifications to the Xen hypervisor and a set of daemons running in dom0 which
15 | together forms a distributed system that manages virtual machine forking.
16 | Guests use a set of calls (i.e. `sf_request_ticket`, `sf_clone`, `sf_join`,
17 | `sf_kill`, and `sf_exit`) to request resources on other machines, fork
18 | children, wait for children, kill children, and exit from a child. This implies
19 | that applications must be modified.  SnowFlock implements the forking mechanism
20 | and leaves policy to pluggable cluster framework management software.
21 | 
22 | SnowFlock takes advantage of four insights with four distinct implementation
23 | details.
24 | 
25 | 1. VMs can get very large, on the order of a couple of GBs. Copying these
26 |    images between physical machines can saturate the network, and even when
27 |    implemented using things like multicast, can still be slow. Thus, SnowFlock
28 |    must reduce the amount of state transfered between machines. SnowFlock takes
29 |    advantage of the fact that a newly forked VM doesn't need the entire VM
30 |    image to start running. Instead, SnowFlock uses *VM Descriptors*: a
31 |    condensed VM image that consists of VM metadata, a few memory pages,
32 |    registers, a global descriptor table, and page tables. When a VM is forked,
33 |    a VM descriptor for it is formed and sent to the children to begin running.
34 | 2. When a VM is created from a VM descriptor, it doesn't have all the memory it
35 |    needs to continue executing. Thus, memory must be sent from the parent when
36 |    it is first accessed. Parent VMs use copy-on-write to maintain an immutable
37 |    copy of memory at the point of the snapshot. Children use Xen shadow pages
38 |    to trap accesses to pages not yet present and request them from the parent
39 |    VM.
40 | 3. Sometimes VMs access memory that they don't really need to get from the
41 |    parent. SnowFlock uses two *avoidance heuristics* to avoid the transfer
42 |    overhead. First, if new memory is being allocated (often indirectly through
43 |    a call to something like malloc), the memory contents are not important and
44 |    do not need to be paged in from the parent. A similar optimization is made
45 |    for buffers being written to by devices.
46 | 4. Finally, parent and children VMs often access the same code and data.
47 |    SnowFlock takes advantage of this data locality by prefetching; when one
48 |    child VM requests a page of memory, the parent multicasts it to all
49 |    children.
50 | 
51 | Furthermore, the same copy-on-write techniques to maintain an immutable
52 | snapshot of memory are used on the disk. And, the parent and children virtual
53 | machines are connected by a virtual subnet in which each child is given an IP
54 | address based on its unique id.
55 | 


--------------------------------------------------------------------------------
/papers/lehman1999t.md:
--------------------------------------------------------------------------------
 1 | ## [T Spaces: The Next Wave (1999)](https://goo.gl/mxIv4g) ##
 2 | **Summary.**
 3 | T Spaces is a
 4 | 
 5 | > tuplespace-based network communication buffer with database capabilities that
 6 | > enables communication between applications and devices in a network of
 7 | > heterogeneous computers and operating systems
 8 | 
 9 | Essentially, it's Linda++; it implements a Linda tuplespace with a couple new
10 | operators and transactions.
11 | 
12 | The paper begins with a history of related tuplespace based work. The notion of
13 | a shared collaborative space originated from AI *blackboard systems* popular in
14 | the 1970's, the most famous of which was the Hearsay-II system. Later, the
15 | Stony Brook microcomputer Network (SBN), a cluster organized in a torus
16 | topology, was developed at Stony Brook, and Linda was invented to program it.
17 | Over time, the domain in which tuplespaces were popular shifted from parallel
18 | programming to distributed programming, and a huge number of Linda-like systems
19 | were implemented.
20 | 
21 | T Spaces is the marriage of tuplespaces, databases, and Java.
22 | 
23 | - *Tuplespaces* provide a flexible communication model;
24 | - *databases* provide stability, durability, and advanced querying; and
25 | - *Java* provides portability and flexibility.
26 | 
27 | T Spaces implements a Linda tuplespace with a few improvements:
28 | 
29 | - In addition to the traditional `Read`, `Write`, `Take`, `WaitToRead`, and
30 |   `WaitToTake` operators, T Spaces also introduces a `Scan`/`ConsumingScan`
31 |   operator to read/take all tuples matched by a query and a `Rhonda` operator
32 |   to exchange tuples between processes.
33 | - Users can also dynamically register new operators, the implementation of
34 |   which takes of advantage of Java.
35 | - Fields of tuples are indexed by name, and tuples can be queried by named
36 |   value. For example, the query `(foo = 8)` returns *all* tuples (of any type)
37 |   with a field `foo` equal to 8. These indexes are similar to the inversions
38 |   implemented in Phase 0 of System R.
39 | - Range queries are supported.
40 | - To avoid storing large values inside of tuples, files URLs can instead be
41 |   stored, and T Spaces transparently handles locating and transferring the file
42 |   contents.
43 | - T Spaces implements a group based ACL form of authorization.
44 | - T Spaces supports transactions.
45 | 
46 | To evaluate the expressiveness and performance of T Spaces, the authors
47 | implement a collaborative web-crawling application, a web-search information
48 | delivery system, and a universal information appliance.
49 | 


--------------------------------------------------------------------------------
/papers/letia2009crdts.md:
--------------------------------------------------------------------------------
 1 | ## [CRDTs: Consistency without concurrency control (2009)](https://scholar.google.com/scholar?cluster=9773072957814807258&hl=en&as_sdt=0,5)
 2 | **Overview.**
 3 | Concurrently updating distributed mutable data is a challenging problem that
 4 | often requires expensive coordination. *Commutative replicated data types*
 5 | (CRDTs) are data types with commutative update operators that can provide
 6 | convergence without coordination. Moreover, non-trivial CRTDs exist; this paper
 7 | presents Treedoc: an ordered set CRDT.
 8 | 
 9 | **Ordered Set CRDT.**
10 | An ordered set CRDT represents an ordered sequence of atoms. Atoms are
11 | associated with IDs with five properties:
12 | 1. Two replicas of the same atom have the same ID.
13 | 2. No two atoms have the same ID.
14 | 3. IDs are constant for the lifetime of an ordered set.
15 | 4. IDs are totally ordered.
16 | 5. The space if IDs is dense. That is for all IDS P and F where P < F, there
17 |    exists an ID N such that P < N < F.
18 | 
19 | The ordered set supports two operations:
20 | 
21 | 1. insert(ID, atom)
22 | 2. delete(ID)
23 | 
24 | where atoms are ordered by their corresponding ID. Concretely, Treedoc is
25 | represented as a tree, and IDs are paths in the tree ordered by an infix
26 | traversal. Nodes in the tree can be *major nodes* and contain multiple
27 | *mini-nodes* where each mini-node is annotated with a totally ordered
28 | *disambiguator* unique to each node.
29 | 
30 | Deletes simply tombstone a node. Inserts work like a typical binary tree
31 | insertion. To avoid the tree and IDs from getting too large, the tree can
32 | periodically be *flattened*: the tree is restructured into an array of nodes.
33 | This operation does not commute nicely with others, so a coordination protocol
34 | like 2PC is required.
35 | 
36 | **Treedoc in the Large Scale.**
37 | Requiring a coordination protocol for flattening doesn't scale and runs against
38 | the spirit of CRDTs. It also doesn't handle churn well. Instead, nodes can be
39 | partitioned into a set of core nodes and a set of nodes in the nebula. The core
40 | nodes coordinate while the nebula nodes lag behind. There are protocols for
41 | nebula nodes to catch up to the core nodes.
42 | 


--------------------------------------------------------------------------------
/papers/li2014automating.md:
--------------------------------------------------------------------------------
 1 | ## [Automating the Choice of Consistency Levels in Replicated Systems (2014)](https://scholar.google.com/scholar?cluster=5894108460168532172&hl=en&as_sdt=0,5)
 2 | Geo-replicated data stores that replicate data have to choose between incurring
 3 | the performance overheads of implementing strong consistency or the brain
 4 | boggling semantics of weak consistency. Some data stores allow users to make
 5 | this decision on a fine grained level, allowing some operations to operate with
 6 | strong consistency while other operations operate under weak consistency. For
 7 | example, [*Making Geo-Replicated Systems Fast as Possible, Consistent when
 8 | Necessary*](https://scholar.google.com/scholar?cluster=4316742817395056095&hl=en&as_sdt=0,5)
 9 | introduced RedBlue consistency in which users annotate operations as red
10 | (strong) or blue (weak). Typically, the process of choosing the consistency
11 | level for each operation has been a manual task. This paper builds off of
12 | RedBlue consistency and automates the process of labeling operations as red or
13 | blue.
14 | 
15 | **Overview.**
16 | When using RedBlue consistency, users can decompose operations into generator
17 | operations and shadow operations to improve commutativity. This paper automates
18 | this process by converting Java applications that write state to a relational
19 | database to issue operations against CRDTs. Moreover, it uses a combination of
20 | static analysis and runtime checks to ensure that operations are invariant
21 | confluent.
22 | 
23 | **Generating Shadow Operations.**
24 | Relations are just sets of tuples, so we can model them as CRDT sets. Moreover,
25 | we can model individual fields as CRDTs. This paper presents field CRDTs
26 | (PN-Counters, LWW-registers) and set CRDTs to model relations. Users then
27 | annotate SQL create statements indicating which CRDT to use.
28 | 
29 | Using a custom JDBC driver, user operations can be converted into a
30 | corresponding shadow operation that issues CRDT operations.
31 | 
32 | **Classification of Shadow Operations.**
33 | We want to find the database states and transaction arguments that guarantee
34 | invariant-confluence. This can be a very difficult (probably undecidable)
35 | problem in general. Instead, we simplify the problem by considering the
36 | possible trace of CRDT operations through a program. Even with this simplifying
37 | assumption, tracing execution through loops can be challenging. We can again
38 | simplify things by only considering loop where each iteration is independent of
39 | one another.
40 | 
41 | We model each program as a regular expression over statements. We then unroll
42 | the regular expression to get the set of all execution traces. Using these
43 | traces we can construct a map from traces, or templates, to weakest
44 | preconditions that ensure invariant confluence.
45 | 
46 | At runtime, we need only look up the weakest precondition given an execution
47 | trace and check that it is true.
48 | 
49 | **Implementation.**
50 | The system is implemented as 15K lines of Java and 533 lines of OCaml. It
51 | builds off of MySQL, an existing Java parser, and Gemini.
52 | 
53 | 


--------------------------------------------------------------------------------
/papers/liu1973scheduling.md:
--------------------------------------------------------------------------------
 1 | # [Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment (1973)](https://scholar.google.com/scholar?cluster=11972780054098474552&hl=en&as_sdt=0,5)
 2 | Consider a hard-real-time environment in which tasks *must* finish within some
 3 | time after they are requested. We make the following assumptions.
 4 | 
 5 | - (A1) Tasks are periodic with fixed periods.
 6 | - (A2) Tasks must finish before they are next requested.
 7 | - (A3) Tasks are independent.
 8 | - (A4) Tasks have constant runtime.
 9 | - (A5) Non-periodic tasks are not realtime.
10 | 
11 | Thus, we can model each task $t_i$ as a period $T_i$ and runtime $C_i$. A
12 | scheduling algorithm that immediately preempts tasks to guarantee that the task
13 | with the highest priority is running is called a **preemptive priority
14 | scheduling algorithm**. We consider three preemptive priority scheduling
15 | algorithms: a static/fixed priority scheduler (in which priorities are assigned
16 | ahead of time), a dynamic priority scheduler (in which priorities are assigned
17 | at runtime), and a mixed scheduling algorithm.
18 | 
19 | ## Fixed Priority Scheduling Algorithm
20 | First, a few definitions:
21 | 
22 | - The **deadline** of a task is the time at which the next request is issued.
23 | - An **overflow** occurs at time $t$ if $t$ is the deadline for an unfulfilled
24 |   task.
25 | - A schedule is **feasible** if there is no overflow.
26 | - The response time of a task is the time between the task's request and the
27 |   task's finish time.
28 | - A **critical instant** for task $t$ is the instant where $t$ has the highest
29 |   response time.
30 | 
31 | It can be shown that the critical instant for any task occurs when the task is
32 | requested simultaneously with all higher priority tasks. This result lets us
33 | easily determine if a feasible fixed priority schedule exists by
34 | pessimistically assuming all tasks are scheduled at their critical instant.
35 | 
36 | It also suggests that given two tasks with periodicities $T_1$ and $T_2$ where
37 | $T_1 < T_2$, we should give higher priority to the shorter task with period
38 | $T_1$.  This leads to the **rate-monotonic priority scheduling algorithm**
39 | where we assign higher priorities to shorter tasks. A feasible static schedule
40 | exists if and only if a feasible rate-monotonic scheduling algorithm exists.
41 | 
42 | Define **processor utilization** to be the fraction of time the processor
43 | spends running tasks. We say a set of tasks **fully utilize** the processor if
44 | there exists a feasible schedule for them, but increasing the running time of
45 | any of the tasks implies there is no feasible schedule. The least upper bound
46 | on processor utilization is the minimum processor utilization for tasks that
47 | fully utilize the processor. For $m$ tasks, the least upper bound is $m(2^{1/m}
48 | - 1)$ which approaches $\ln(2)$ for large $m$.
49 | 
50 | ## Deadline Driven Scheduling Algorithm
51 | The **deadline driven scheduling algorithm** (or earliest deadline first
52 | scheduling algorithm) dynamically assigns the highest priority to the task with
53 | the most imminent deadline. This scheduling algorithm has a least upper bound
54 | of 100% processor utilization. Moreover, if any feasible schedule exists for a
55 | set of tasks, a feasible deadline driven schedule exists.
56 | 
57 | ## Mixed Scheduling Algorithm.
58 | Scheduling hardware (at the time) resembled a fixed priority scheduler, but a
59 | dynamic scheduler could be implemented for less frequent tasks. A hybrid
60 | scheduling algorithm scheduled the $k$ most frequent tasks using the
61 | rate-monotonic scheduling algorithm and scheduled the rest using the deadline
62 | driven algorithm.
63 | 
64 | <script type="text/javascript" async
65 |   src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML">
66 | </script>
67 | 


--------------------------------------------------------------------------------
/papers/mckeen2013innovative.md:
--------------------------------------------------------------------------------
 1 | ## [Innovative Instructions and Software Model for Isolated Execution (2013)](https://scholar.google.com/scholar?cluster=11948934428694485446&hl=en&as_sdt=0,5)
 2 | **Summary.**
 3 | Applications are responsible for managing an increasing amount of sensitive
 4 | information. Intel SGX is a set of new instructions and memory access changes
 5 | that allow users to put code and data into secured *enclaves* that are
 6 | inaccessible even to privileged code. The enclaves provide confidentiality,
 7 | integrity, and isolation.
 8 | 
 9 | A process' virtual memory space is divided into different sections. There's a
10 | section for the code, a section for the stack, a section for the heap, etc. An
11 | enclave is just another region in the user's address space, except the enclave
12 | has some special properties. The enclave can store code and data. When a
13 | process is running code in the enclave, it can access data in the enclave.
14 | Otherwise, the enclave data is off limits.
15 | 
16 | Each enclave is composed of a set of pages, and these pages are stored in the
17 | *Enclave Page Cache (EPC)*.
18 | 
19 |     +---------------------------------+
20 |     | .-----. .-----. .-----. .-----. |
21 |     | |    \| |    \| |    \| |    \| |
22 |     | |     | |     | |     | |     | | EPC
23 |     | |     | |     | |     | |     | |
24 |     | '.....' '.....' '.....' '.....' |
25 |     +---------------------------------+
26 | 
27 | In addition to storing enclave pages, the EPC also stores SGX structures (I
28 | guess Enclave Page and SGX Structures Cache (EPSSC) was too long of an
29 | acronym). The EPC is protected from hardware and software access. A related
30 | structure, the *Enclave Page Cache Map* (EPCM), stores a piece of metadata for
31 | each active page in the EPC. Moreover, each enclave is assigned a *SGX Enclave
32 | Control Store* (SECS). There are instructions to create, add pages to, secure,
33 | enter, and exit an enclave.
34 | 
35 | Computers have a finite, nay scarce amount of memory. In order to allow as many
36 | processes to operate with this scarce resource, operating systems implement
37 | paging. Active pages of memory are stored in memory, while inactive pages are
38 | flushed to the disk. Analogously, in order to allow as many processes to use
39 | enclaves as possible, SGX allows for pages in the EPC to be paged to main
40 | memory. The difficulty is that the operating system is not trusted and neither
41 | is main memory.
42 | 
43 |     +---------------------------------+
44 |     | .-----. .-----.         .-----. |
45 |     | |    \| |    \|         |    \| |
46 |     | | VA  | |  ^  |    |    |     | | EPC (small and trusted)
47 |     | |     | |  |  |    |    |     | |
48 |     | '.....' '..|..'    |    '.....' |
49 |     +------------|-------|------------+
50 |                  |       | (paging)
51 |     +------------|-------|-----------------------
52 |     | .-----.    |    .--|--. .-----. .-----.
53 |     | |    \|    |    |  | \| |    \| |    \|
54 |     | |     |    |    |  v  | |     | |     | ... main memory (big but not trusted)
55 |     | |     |         |     | |     | |     |
56 |     | '.....'         '.....' '.....' '.....'
57 |     +--------------------------------------------
58 | 
59 | In order to page an EPC page to main memory, all cached translations that point
60 | to it must first be cleared. Then, the page is encrypted. A nonce, called a
61 | version, is created for the page and put into a special *Version Array* (VA)
62 | page in the EPC. A MAC is taken of the encrypted contents, the version, and the
63 | page's metadata and is stored with the file in main memory. When the page is
64 | paged back into the EPC, the MAC is checked against the version in the VA
65 | before the VA is cleared.
66 | 


--------------------------------------------------------------------------------
/papers/mckusick1984fast.md:
--------------------------------------------------------------------------------
 1 | # [A Fast File System for UNIX (1984)](https://scholar.google.com/scholar?cluster=1900924654174602790)
 2 | The **Fast Filesystem** (FFS) improved the read and write throughput of the
 3 | original Unix file system by 10x by
 4 | 
 5 | 1. increasing the block size,
 6 | 2. dividing blocks into fragments, and
 7 | 3. performing smarter allocation.
 8 | 
 9 | The original Unix file system, dubbed "the old file system", divided disk
10 | drives into partitions and loaded a file system on to each partition. The
11 | filesystem included a superblock containing metadata, a linked list of free
12 | data blocks known as the **free list**, and an **inode** for every file.
13 | Notably, the file system was composed of **512 byte** blocks; no more than 512
14 | bytes could be transfered from the disk at once. Moreover, the file system had
15 | poor data locality. Files were often sprayed across the disk requiring lots of
16 | random disk accesses.
17 | 
18 | The "new file system" improved performance by increasing the block size to any
19 | power of two at least as big as **4096 bytes**. In order to handle small files
20 | efficiently and avoid high internal fragmentation and wasted space, blocks were
21 | further divided into **fragments** at least as large as the disk sector size.
22 | 
23 | ```
24 |       +------------+------------+------------+------------+
25 | block | fragment 1 | fragment 2 | fragment 3 | fragment 4 |
26 |       +------------+------------+------------+------------+
27 | ```
28 | 
29 | Files would occupy as many complete blocks as possible before populating at
30 | most one fragmented block.
31 | 
32 | Data was also divided into **cylinder groups** where each cylinder group included
33 | a copy of the superblock, a list of inodes, a bitmap of available blocks (as
34 | opposed to a free list), some usage statistics, and finally data blocks. The
35 | file system took advantage of hardware specific information to place data at
36 | rotational offsets specific to the hardware so that files could be read with as
37 | little delay as possible. Care was also taken to allocate files contiguously,
38 | similar files in the same cylinder group, and all the inodes in a directory
39 | together. Moreover, if the amount of available space gets too low, then it
40 | becomes more and more difficult to allocate blocks efficiently. For example, it
41 | becomes hard to allocate the files of a block contiguously. Thus, the system
42 | always tries to keep ~10% of the disk free.
43 | 
44 | Allocation is also improved in the FFS. A top level global policy uses file
45 | system wide information to decide where to put new files. Then, a local policy
46 | places the blocks. Care must be taken to colocate blocks that are accessed
47 | together, but crowding a single cyclinder group can exhaust its resources.
48 | 
49 | In addition to performance improvements, FFS also introduced
50 | 
51 | 1. longer filenames,
52 | 2. advisory file locks,
53 | 3. soft links,
54 | 4. atomic file renaming, and
55 | 5. disk quota enforcement.
56 | 


--------------------------------------------------------------------------------
/papers/moore2006inferring.md:
--------------------------------------------------------------------------------
 1 | ## [Inferring Internet Denial-of-Service Activity (2001)](TODO) ##
 2 | **Summary.**
 3 | This paper uses *backscatter analysis* to quantitatively analyze
 4 | denial-of-service attacks on the Internet. Most flooding denial-of-service
 5 | attacks involve IP spoofing, where each packet in an attack is given a faux IP
 6 | address drawn uniformly at random from the space of all IP addresses. If the
 7 | packet elicits the victim to issue a reply packet, then victims of
 8 | denial-of-service attacks end up sending unsolicited messages to servers
 9 | uniformly at random. By monitoring this *backscatter* at enough hosts, one can
10 | infer the number, intensity, and type of denial-of-service attacks.
11 | 
12 | There are of course a number of assumptions upon which backscatter depends.
13 | 
14 | 1. *Address uniformity*. It is assumed that DOS attackers spoof IP addresses
15 |    uniformally at random.
16 | 2. *Reliable delivery*. It is assumed that packets, as part of the attack and
17 |    response, are delivered reliably.
18 | 3. *Backscatter hypothesis*. It is assumed that unsolicited packets arriving at
19 |    a host are actually part of backscatter.
20 | 
21 | The paper performs a backscatter analysis on 1/256 of the IPv4 address space.
22 | They cluster the backscatter data using a *flow-based classification* to
23 | measure individual attacks and using an *event-based classification* to measure
24 | the intensity of attacks. The findings of the analysis are best summarized by
25 | the paper.
26 | 


--------------------------------------------------------------------------------
/papers/prabhakaran2005analysis.md:
--------------------------------------------------------------------------------
 1 | ## [Analysis and Evolution of Journaling File Systems (2005)](TODO) ##
 2 | **Summary.**
 3 | The authors develop and apply two file system analysis techniques dubbed
 4 | *Semantic Block-Level Analysis* (SBA) and *Semantice Trace Playback* (STP) to
 5 | four journaled file systems: ext3, ReiserFS, JFS, and NTFS.
 6 | 
 7 | - Benchmarking a file system can tell you for *which* workloads it is fast and
 8 |   for which it is slow. But, these benchmarks don't tell you *why* the file
 9 |   system performs the way it does. By leveraging semantic information about
10 |   block traces, SBA aims to identify the cause of file system behavior.
11 | 
12 |   Users install an SBA driver into the OS and mount the file system of interest
13 |   on to the SBA driver. The interposed driver intercepts and logs all
14 |   block-level requests and responses to and from the disk. Moreover, the SBA
15 |   driver is specialized to each file system under consideration so that it can
16 |   interpret each block operation, categorizing it as a read/write to a journal
17 |   block or regular data block. Implementing an SBA driver is easy to do,
18 |   guarantees that no operation goes unlogged, and has low overhead.
19 | - Deciding the effectiveness of new file system policies is onerous. For
20 |   example, to evaluate a new journaling scheme, you would traditionally have to
21 |   implement the new scheme and evaluate it on a set of benchmarks. If it
22 |   performs well, you keep the changes; otherwise, you throw them away. STP uses
23 |   block traces to perform a light-weight simulation to analyze new file system
24 |   policies without implementation overhead.
25 | 
26 |   STP is a user-level process that reads in block traces produced by SBA and
27 |   file system operation logs and issues direct I/O requests to the disk. It can
28 |   then be used to evaluate small, simple modifications to existing file
29 |   systems. For example, it can be used to evaluate the effects of moving the
30 |   journal from the beginning of the file system to the middle of the file
31 |   system.
32 | 
33 | The authors spend the majority of the paper examining *ext3*: the third
34 | extended file system. ext3 introduces journaling to ext2, and ext2 resembles
35 | the [Unix FFS](#a-fast-file-system-for-unix-1984) with partitions divided into
36 | groups each of which contains bitmaps, inodes, and regular data. ext3 comes
37 | with three journaling modes:
38 | 
39 | 1. Using *writeback* journaling, metadata is journaled and data is
40 |    asynchronously written to disk. This has the weakest consistency guarantees.
41 | 2. Using *ordered* journaling, data is written to disk before its associated
42 |    metatada is journaled.
43 | 3. Using *data* journaling, both data and metadata is journaled before being
44 |    *checkpointed*: copied from the journal to the disk.
45 | 
46 | Moreover, operations are grouped into *compound transactions* and issued in
47 | batch. ext3 SBA analysis led to the following conclusions:
48 | 
49 | - The fastest journaling mode depends heavily on the workload.
50 | - Compound transactions can lead to *tangled synchrony* in which asynchronous
51 |   operations are made synchronous when placed in a transaction with synchronous
52 |   operations.
53 | - In ordered journaling, ext3 doesn't concurrently write to the journal and the
54 |   disk.
55 | 
56 | SPT was also used to analyze the effects of
57 | 
58 | - Journal position in the disk.
59 | - An adaptive journaling mode that dynamically chooses between ordered and data
60 |   journaling.
61 | - Untangling compound transactions.
62 | - Data journaling in which diffs are journaled instead of whole blocks.
63 | 
64 | SBA and STP was also applied to ReiserFS, JFS, and NTFS.
65 | 


--------------------------------------------------------------------------------
/papers/quigley2009ros.md:
--------------------------------------------------------------------------------
 1 | # [ROS: an open-source Robot Operating System (2009)](https://scholar.google.com/scholar?cluster=143767492575573826)
 2 | Writing code that runs on a robot is a *very* challenging task. Hardware varies
 3 | from robot to robot, and the software required to perform certain tasks (e.g.
 4 | picking up objects) can require an enormous amount of code (e.g. low-level
 5 | drivers, object detection, motion planning, etc.). ROS, the Robot Operating
 6 | System, is a framework for writing and managing distributed systems that run on
 7 | robots. Note that ROS is not an operating system as the name suggests.
 8 | 
 9 | ## Nomenclature
10 | A **node** is a process (or software module) that performs computation. Nodes
11 | communicate by sending **messages** (like protocol buffers) to one another.
12 | Nodes can publish messages to a **topic** or subscribe to a topic to receive
13 | messages. ROS also provides **services** (i.e. RPCs) which are defined by a
14 | service name, a request message type, and a response message type.
15 | 
16 | ## What is ROS?
17 | In short, ROS provides the following core functionality.
18 | 
19 | - ROS provides a messaging format similar to protocol buffers. Programmers
20 |   define messages using a ROS IDL and a compiler generates code in various
21 |   languages (e.g. C++, Octave, LISP). Processes running on robots then send
22 |   messages to one another using XML RPC.
23 | - ROS provides command line tools to debug or alter the execution of
24 |   distributed systems. For example, one command line tool can be used to log a
25 |   message stream to disk without having to change any source code. These logged
26 |   messages can then be replayed to develop and test other modules. Other
27 |   command line tools are described below.
28 | - ROS organizes code into **packages**. A package is just a directory of code
29 |   (or data or anything really) with some associated metadata. Packages are
30 |   organized into **repositories** which are trees of directories with packages
31 |   at the leaves. ROS provides a package manager to query repositories, download
32 |   packages, and build code.
33 | 
34 | ## Design Goals
35 | ROS has the following design goals which motivate its design.
36 | 
37 | - **Peer-to-peer.** Because ROS systems are running on interconnected robots,
38 |   it makes sense for systems to be written in a peer-to-peer fashion. For
39 |   example, imagine two clusters of networked robots communicate over a slow
40 |   wireless link. Running a master on either of the clusters will slow down the
41 |   system.
42 | - **Multilingual.** ROS wants to support multiple programming languages which
43 |   motivated its protobuf-like message format.
44 | - **Tools-based.** Instead of being a monolithic code base, ROS includes a
45 |   disaggregated set of command line tools.
46 | - **Thin.** Many robot algorithms developed by researchers *could* be re-used
47 |   but often isn't because it becomes tightly coupled with the researcher's
48 |   specific environment. ROS encourages algorithms to developed agnostic to ROS
49 |   so they can be easily re-used.
50 | - **Free and Open-Source** By being free and open-source, ROS is easier to
51 |   debug and encourages collaboration between lots of researchers.
52 | 
53 | ## Use Cases
54 | - **Debugging a single node.** Because ROS systems are loosely coupled modules
55 |   communicating via RPC, one module can be debugged against other already
56 |   debugged modules.
57 | - **Logging and playback.** As mentioned above, message streams can be
58 |   transparently logged to disk for future replay.
59 | - **Packaged subsystems.** Programmers can describe the structure of a
60 |   distributed system, and ROS can launch the system on across multiple hosts.
61 | - **Collaborative development.** ROS' packages and repositories encourage
62 |   collaboration.
63 | - **Visualization and monitoring.** Message streams can be intercepted and
64 |   visualized over time. Subscribe streams can also be filtered by expressions
65 |   before being visualized.
66 | - **Composition of functionality.** Namespaces can be used to launch the same
67 |   system multiple times.
68 | 


--------------------------------------------------------------------------------
/papers/ritchie1978unix.md:
--------------------------------------------------------------------------------
 1 | # [The Unix Time-Sharing System (1974)](https://scholar.google.com/scholar?cluster=2132419950152599605&hl=en&as_sdt=0,5)
 2 | Unix was an operating system developed by Dennis Ritchie, Ken Thompson, and
 3 | others at Bell Labs. It was the successor to Multics and is probably the single
 4 | most influential piece of software ever written.
 5 | 
 6 | Earlier versions of Unix were written in assembly, but the project was later
 7 | ported to C: probably the single most influential programming language ever
 8 | developed. This resulted in a 1/3 increase in size, but the code was much more
 9 | readable and the system included new features, so it was deemed worth it.
10 | 
11 | The most important feature of Unix was its file system. Ordinary files were
12 | simple arrays of bytes physically stored as 512-byte blocks: a rather simple
13 | design. Each file was given an inumber: an index into an ilist of inodes. Each
14 | inode contained metadata about the file and pointers to the actual data of the
15 | file in the form of direct and indirect blocks. This representation made it
16 | easy to support (hard) linking. Each file was protected with 9 bits: the same
17 | protection model Linux uses today. Directories were themselves files which
18 | stored mappings from filenames to inumbers. Devices were modeled simply as
19 | files in the `/dev` directory. This unifying abstraction allowed devices to be
20 | accessed with the same API. File systems could be mounted using the `mount`
21 | command. Notably, Unix didn't support user level locking, as it was neither
22 | necessary nor sufficient.
23 | 
24 | Processes in Unix could be created using a fork followed by an exec, and
25 | processes could communicate with one another using pipes. The shell was nothing
26 | more than an ordinary process. Unix included file redirection, pipes, and the
27 | ability to run programs in the background. All this was implemented using fork,
28 | exec, wait, and pipes.
29 | 
30 | Unix also supported signals.
31 | 


--------------------------------------------------------------------------------
/papers/saltzer1984end.md:
--------------------------------------------------------------------------------
 1 | # [End-to-End Arguments in System Design (1984)](https://scholar.google.com/scholar?cluster=9463646641349983499)
 2 | This paper presents the **end-to-end argument**:
 3 | 
 4 | > The function in question can completely and correctly be implemented only
 5 | > with the knowledge and help of the application standing at the end points of
 6 | > the communication system. Therefore, providing that questioned function as a
 7 | > feature of the communication system itself is not possible. (Sometimes an
 8 | > incomplete version of the function provided by the communication system may
 9 | > be useful as a performance enhancement.)
10 | 
11 | which says that in a layered system, functionality should, nay must be
12 | implemented as close to the application as possible to ensure correctness (and
13 | usually also performance).
14 | 
15 | The end-to-end argument is motivated by an example file transfer scenario in
16 | which host A transfers a file to host B. Every step of the file transfer
17 | presents an opportunity for failure. For example, the disk may silently corrupt
18 | data or the network may reorder or drop packets. Any attempt by one of these
19 | subsystems to ensure reliable delivery is wasted effort since the delivery may
20 | still fail in another subsystem. The only way to guarantee correctness is to
21 | have the file transfer application check for correct delivery itself. For
22 | example, once it receives the entire file, it can send the file's checksum back
23 | to host A to confirm correct delivery.
24 | 
25 | In addition to being necessary for correctness, applying the end-to-end
26 | argument also usually leads to improved performance. When a functionality is
27 | implemented in a lower level subsystem, every application built on it must pay
28 | the cost, even if it does not require the functionality.
29 | 
30 | There are numerous other examples of the end-to-end argument:
31 | 
32 | - Guaranteed packet delivery.
33 | - Secure data transmission.
34 | - Duplicate message suppression.
35 | - FIFO delivery.
36 | - Transaction management.
37 | - RISC.
38 | 
39 | The end-to-end argument is not a hard and fast rule. In particular, it may be
40 | eschewed when implementing a functionality in a lower level can lead to
41 | performance improvements. Consider again the file transfer protocol above and
42 | assume the network drops one in every 100 packets. As the file becomes longer,
43 | the odds of a successful delivery become increasingly small making it
44 | prohibitively expensive for the application alone to ensure reliable delivery.
45 | The network may be able to perform a small amount of work to help guarantee
46 | reliable delivery making the file transfer more efficient.
47 | 


--------------------------------------------------------------------------------
/papers/stonebraker1987design.md:
--------------------------------------------------------------------------------
 1 | # [The Design of the POSTGRES Storage System (1987)](https://scholar.google.com/scholar?cluster=6675294870941893293)
 2 | POSTGRES, the ancestor of PostgreSQL, employed a storage system with three
 3 | interesting characteristics:
 4 | 
 5 | 1. No write-ahead logging (WAL) was used. In fact, there was no recovery code
 6 |    at all.
 7 | 2. The entire database history was recorded and archived. Updates were
 8 |    converted to updates, and data could be queried arbitrarily far in the past.
 9 | 3. The system was designed as a collection of asynchronous processes, rather
10 |    than a monolithic piece of code.
11 | 
12 | Transactions were sequentially assigned 40-bit transaction identifiers (XID)
13 | starting from 0. Each operation in a transaction was sequentially assigned a
14 | command identifiers (CID). Together the XID and CID formed a 48 bit interaction
15 | identifier (IID). Each IID was also assigned a two-bit transaction status and
16 | all IIDs were stored in a transaction log with a most recent **tail** of
17 | uncommitted transactions and a **body** of completed transactions.
18 | 
19 | Every tuple in a relation was annotated with
20 | 
21 | - a record id,
22 | - a min XID, CID, and timestamp,
23 | - a max XID, CID and timestamp, and
24 | - a forward pointer.
25 | 
26 | The min values were associated with the transaction that created the record,
27 | and the max values were associated with the transaction that updated the
28 | record. When a record was updated, a new tuple was allocated with the same
29 | record id but updated min values, max values, and forward pointers. The new
30 | tuples were stored as diffs; the original tuple was the **anchor point**; and
31 | the forward pointers chained together the anchor point with its diffs.
32 | 
33 | Data could be queried at a particular timestamp or in a range of timestamps.
34 | Moreover, the min and max values of the records could be extracted allowing for
35 | queries like this:
36 | 
37 |     SELECT Employee.min_timestamp, Eployee.max_timestamp, Employee.id
38 |     FROM Employee[1 day ago, now]
39 |     WHERE Employee.Salary > 10,000
40 | 
41 | The timestamp of a transaction was not assigned when the transaction began.
42 | Instead, the timestamps were maintained in a TIME relation, and the timestamps
43 | in the records were left empty and asynchronously filled in. Upon creation,
44 | relations could be annotated as
45 | 
46 | - **no archive** in which case timestamps were never filled in,
47 | - **light archive** in which timestamps were read from a TIME relation, or
48 | - **heavy archive** in which timestamps were lazily copied from the TIME
49 |   relation into the records.
50 | 
51 | POSTGRES allowed for any number of indexes. The type of index (e.g. B-tree) and
52 | the operations that the index efficiently supported were explicitly set by the
53 | user.
54 | 
55 | A **vacuum cleaner** process would, by instruction of the user, vacuum records
56 | stored on disk to an archival storage (e.g. WORM device). The archived data was
57 | allowed to have a different set of indexes. The vacuum cleaning proceeded in
58 | three steps:
59 | 
60 | 1. Data was archived and archive indexes were formed.
61 | 2. Anchor points were updated in the database.
62 | 3. Archived data space was reclaimed.
63 | 
64 | The system could crash during this process which could lead to duplicate
65 | entries, but nothing more nefarious. The consistency guarantees were a bit weak
66 | compared to today's standards. Some crashes could lead to slowly accumulating
67 | un-reclaimed space.
68 | 
69 | Archived data could be indexed by values and by time ranges efficiently using
70 | R-trees. Multi-media indexes which spanned the disk and archive were also
71 | supported.
72 | 


--------------------------------------------------------------------------------
/papers/vavilapalli2013apache.md:
--------------------------------------------------------------------------------
 1 | ## [Apache Hadoop YARN: Yet Another Resource Negotiator (2013)](https://scholar.google.com/scholar?cluster=3355598125951377731&hl=en&as_sdt=0,5)
 2 | **Summary.**
 3 | Hadoop began as a MapReduce clone designed for large scale web crawling. As big
 4 | data became trendy and data became... big, Hadoop became the de facto
 5 | standard data processing system, and large Hadoop clusters were installed in
 6 | many companies as "the" cluster. As application requirements evolved, users
 7 | started abusing the large Hadoop in unintended ways. For example, users would
 8 | submit map-only jobs which were thinly guised web servers. Apache Hadoop YARN
 9 | is a cluster manager that aims to disentangle cluster management from
10 | programming paradigm and has the following goals:
11 | 
12 | - Scalability
13 | - Multi-tenancy
14 | - Serviceability
15 | - Locality awareness
16 | - High cluster utilization
17 | - Reliability/availability
18 | - Secure and auditable operation
19 | - Support for programming model diversity
20 | - Flexible resource model
21 | - Backward compatibility
22 | 
23 | YARN is orchestrated by a per-cluster *Resource Manager* (RM) that tracks
24 | resource usage and node liveness, enforces allocation invariants, and
25 | arbitrates contention among tenants. *Application Masters* (AM) are responsible
26 | for negotiating resources with the RM and manage the execution of single job.
27 | AMs send ResourceRequests to the RM telling it resource requirements, locality
28 | preferences, etc. In return, the RM hands out *containers* (e.g.  <2GB RAM, 1
29 | CPU>) to AMs. The RM also communicates with Node Managers (NM) running on each
30 | node which are responsible for measuring node resources and managing (i.e.
31 | starting and killing) tasks. When a user want to submit a job, it sends it to
32 | the RM which hands a capability to an AM to present to an NM. The RM is a
33 | single point of failure. If it fails, it restores its state from disk and kills
34 | all running AMs. The AMs are trusted to be faul-tolerant and resubmit any
35 | prematurely terminated jobs.
36 | 
37 | YARN is deployed at Yahoo where it manages roughly 500,000 daily jobs. YARN
38 | supports frameworks like Hadoop, Tez, Spark, Dryad, Giraph, and Storm.
39 | 
40 | 


--------------------------------------------------------------------------------
/papers/verma2015large.md:
--------------------------------------------------------------------------------
 1 | ## [Large-scale cluster management at Google with Borg (2015)](https://scholar.google.com/scholar?cluster=18268680833362692042&hl=en&as_sdt=0,5)
 2 | **Summary.**
 3 | Borg is Google's cluster manager. Users submit *jobs*, a collection of *tasks*,
 4 | to Borg which are then run in a single *cell*, many of which live inside a
 5 | single *cluster*. Borg jobs are either high priority latency-sensitive
 6 | *production* jobs (e.g. user facing products and core infrastructure) or low
 7 | priority *non-production* batch jobs. Jobs have typical properties like name
 8 | and owner and can also express constraints (e.g. only run on certain
 9 | architectures). Tasks also have properties and state their resource demands.
10 | Borg jobs are specified in BCL and are bundled as statically linked
11 | executables. Jobs are labeled with a priority and must operate within quota
12 | limits.  Resources are bundled into *allocs* in which multiple tasks can run.
13 | Borg also manages a naming service, and exports a UI called Sigma to
14 | developers.
15 | 
16 | Cells are managed by five-way replicated *Borgmasters*. A Borgmaster
17 | communicates with *Borglets* running on each machine via RPC, manages the Paxos
18 | replicated state of system, and exports information to Sigma. There is also a
19 | high fidelity borgmaster simulator known as the Fauxmaster which can used for
20 | debugging.
21 | 
22 | One subcomponent of the Borgmaster handles scheduling. Submitted jobs are
23 | placed in a queue and scheduled by priority and round-robin within a priority.
24 | Each job undergoes feasibility checking where Borg checks that there are enough
25 | resources to run the job and then scoring where Borg determines the best place
26 | to run the job. Worst fit scheduling spreads jobs across many machines allowing
27 | for spikes in resource usage. Best fit crams jobs as closely as possible which
28 | is bad for bursty loads. Borg uses a scheduler which attempts to limit
29 | "stranded resources": resources on a machine which cannot be used because other
30 | resources on the same machine are depleted. Tasks that are preempted are placed
31 | back on the queue. Borg also tries to place jobs where their packages are
32 | already loaded, but offers no other form of locality.
33 | 
34 | Borglets run on each machine and are responsible for starting and stopping
35 | tasks, managing logs, and reporting to the Borgmaster. The Borgmaster
36 | periodically polls the Borglets (as opposed to Borglets pushing to the
37 | Borgmaster) to avoid any need for flow control or recovery storms.
38 | 
39 | The Borgmaster performs a couple of tricks to achieve high scalability.
40 | 
41 | - The scheduler operates on slightly stale state, a form of "optimistic
42 |   scheduling".
43 | - The Borgmaster caches job scores.
44 | - The Borgmaster performs feasibility checking and scoring for all equivalent
45 |   jobs at once.
46 | - Complete scoring is hard, so the Borgmaster uses randomization.
47 | 
48 | The Borgmaster puts the onus of fault tolerance on applications, expecting them
49 | to handle occasional failures. Still, the Borgmaster also performs a set of
50 | nice tricks for availability.
51 | 
52 | - It reschedules evicted tasks.
53 | - It spreads tasks across failure domains.
54 | - It limits the number of tasks in a job that can be taken down due to
55 |   maintenance.
56 | - Avoids past machine/task pairings that lead to failure.
57 | 
58 | To measure cluster utilization, Google uses a *cell compaction* metric: the
59 | smallest a cell can be to run a given workload. Better utilization leads
60 | directly to savings in money, so Borg is very focused on improving utilization.
61 | For example, it allows non-production jobs to reclaim unused resources from
62 | production jobs.
63 | 
64 | Borg uses containers for isolation. It also makes sure to throttle or kill jobs
65 | appropriately to ensure performance isolation.
66 | 
67 | 


--------------------------------------------------------------------------------
/papers/vogels2009eventually.md:
--------------------------------------------------------------------------------
 1 | # [Eventually Consistent (2009)](https://scholar.google.com/scholar?cluster=4308857796184904369)
 2 | In this CACM article, Werner Vogels discusses eventual consistency as well as a
 3 | couple other forms of consistency.
 4 | 
 5 | ## Historical Perspective
 6 | Back in the day, when people were thinking about how to build distributed
 7 | systems, they thought that strong consistency was the only option. Anything
 8 | else would just be incorrect, right? Well, fast forward to the 90's and
 9 | availability started being favored over consistency. In 2000, Brewer released
10 | the CAP theorem unto the world, and weak consistency really took off.
11 | 
12 | ## Client's Perspective of Consistency
13 | There is a zoo of consistency from the perspective of a client.
14 | 
15 | - **Strong consistency.** A data store is strongly consistent if after a write
16 |   completes, it is visible to all subsequent reads.
17 | - **Weak consistency.** Weak consistency is a synonym for no consistency. Your
18 |   reads can return garbage.
19 | - **Eventual consistency.** "the storage system guarantees that if no new
20 |   updates are made to the object, eventually all accesses will return the last
21 |   updated value."
22 | - **Causal consistency.** If A issues a write and then contacts B, B's read
23 |   will see the effect of A's write since it causally comes after it. Causally
24 |   unrelated reads can return whatever they want.
25 | - **Read-your-writes consistency.** Clients read their most recent write or a
26 |   more recent version.
27 | - **Session consistency.** So long as a client maintains a session, it gets
28 |   read-your-writes consistency.
29 | - **Monotonic read consistency.** Over time, a client will see increasingly
30 |   more fresh versions of data.
31 | - **Monotonic write consistency.** A client's writes will be executed in the
32 |   order in which they are issued.
33 | 
34 | ## Server's Perspective of Consistency
35 | Systems can implement consistency using quorums in which a write is sent to W
36 | of the N replicas, and a read is sent to R of the N replicas. If R + W > N,
37 | then we have strong consistency.
38 | 


--------------------------------------------------------------------------------
/papers/welsh2001seda.md:
--------------------------------------------------------------------------------
 1 | ## [SEDA: An Architecture for Well-Conditioned, Scalable Internet Services (2001)](https://goo.gl/wrn04s) ##
 2 | **Summary.**
 3 | Writing distributed Internet applications is difficult because they have to
 4 | serve a huge number of requests, and the number of requests is highly variable
 5 | and bursty. Moreover, the applications themselves are also complicated pieces
 6 | of code. This paper introduces the *staged event-driven architecture* (SEDA)
 7 | which has the following goals.
 8 | 
 9 | - *Massive concurrency.* Applications written using SEDA should be able to
10 |   support a very large number of clients.
11 | - *Simplify the construction of well-conditioned services.* A
12 |   *well-conditioned* service is one that gracefully degrades. Throughput
13 |   increases with the number of clients until it saturates at some threshold. At
14 |   this point, throughput remains constant and latency increases proportionally
15 |   with the number of clients. SEDA is designed to make writing well-conditioned
16 |   services easy.
17 | - *Enable introspection.* SEDA applications should be able to inspect and adapt
18 |   to incoming request queues. Some request-per-thread architectures, for
19 |   example, do not enable introspection. Control over thread scheduling is left
20 |   completely to the OS; it cannot be adapted to the queue of incoming
21 |   requests.
22 | - *Self-tuning resource management.* SEDA programmers should not have to tune
23 |   knobs themselves.
24 | 
25 | 
26 | SEDA accomplishes these goals by structuring applications as a *network* of
27 | *event-driven stages* connected by *explicit message queues* and managed by
28 | *dynamic resource controllers*. That's a dense sentence, so let's elaborate.
29 | 
30 | Threading based concurrency models have scalability limitations due to the
31 | overheads of context switching, poor caching, synchronization, etc. The
32 | [event-driven concurrency
33 | model](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-events.pdf) involves
34 | nothing more than a single-threaded loop that reads messages and processes
35 | them. It avoids many of the scalability limitations that threading models face.
36 | In SEDA, the atomic unit of execution is a *stage* and is implemented using an
37 | event-driven concurrency model. Each stage has an input queue of messages which
38 | are read in batches by a thread pool and processed by a user-provided event
39 | handler which can in turn write messages to other stages.
40 | 
41 | A SEDA application is simply a network (i.e. graph) of interconnected stages.
42 | Notably, the input queue of every stage is finite. This means that when one
43 | stage tries to write data to the input queue of another stage, it may fail.
44 | When this happens, stages have to block (i.e. pushback), start dropping
45 | requests (i.e. load shedding), degrade service, deliver an error to the user,
46 | etc.
47 | 
48 | To ensure SEDA applications are well-conditioned, various resource managers
49 | tune SEDA application parameters to ensure consistent performance. For example,
50 | the *thread pool controller* scales the number of threads within stages based
51 | on the number of messages in its input queue. Similarly, the *batching
52 | controller* adjusts the size of the batch delivered to each event handler.
53 | 
54 | The authors developed a SEDA prototype in Java dubbed Sandstorm. As with all
55 | event-driven concurrency models, Sandstorm depends on asynchronous I/O
56 | libraries. It implements asynchronous network I/O as three stages using
57 | existing OS functionality (i.e. select/poll). It implements asynchronous file
58 | I/O using a dynamically resizable thread pool that issues synchronous calls; OS
59 | support for asynchronous file I/O was weak at the time. The authors evaluate
60 | Sandstorm by implementing and evaluating an HTTP server and Gnutella router.
61 | 


--------------------------------------------------------------------------------
/papers/zaharia2012resilient.md:
--------------------------------------------------------------------------------
 1 | # [Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (2012)](TODO) ##
 2 | Frameworks like MapReduce made processing large amounts of data easier, but
 3 | they did not leverage distributed memory. If a MapReduce was run iteratively,
 4 | it would write all of its intermediate state to disk: something that was
 5 | prohibitively slow. This limitation made batch processing systems like
 6 | MapReduce ill-suited to *iterative* (e.g. k-means clustering) and *interactive*
 7 | (e.g. ad-hoc queries) workflows. Other systems like Pregel did take advantage
 8 | of distributed memory and reused the in-memory data across computations, but
 9 | the systems were not general-purpose.
10 | 
11 | Spark uses **Resilient Distributed Datasets** (RDDs) to perform general
12 | computations in memory. RDDs are immutable partitioned collections of records.
13 | Unlike pure distributed shared memory abstractions which allow for arbitrary
14 | fine-grained writes, RDDs can only be constructed using coarse-grained
15 | transformations from on-disk data or other RDDs. This weaker abstraction can be
16 | implemented efficiently. Spark also uses RDD lineage to implement low-overhead
17 | fault tolerance. Rather than persist intermediate datasets, the lineage of an
18 | RDD can be persisted and efficiently recomputed. RDDs could also be
19 | checkpointed to avoid the recomputation of a long lineage graph.
20 | 
21 | Spark has a Scala-integrated API and comes with a modified interactive
22 | interpreter. It also includes a large number of useful **transformations**
23 | (which construct RDDs) and **actions** (which derive data from RDDs). Users can
24 | also manually specify RDD persistence and partitioning to further improve
25 | performance.
26 | 
27 | Spark subsumed a huge number of existing data processing frameworks like
28 | MapReduce and Pregel in a small amount of code. It was also much, much faster
29 | than everything else on a large number of applications.
30 | 


--------------------------------------------------------------------------------