├── README.md
├── assets
├── cegis
│ ├── cegis.drawio
│ ├── cegis.png
│ ├── duel.drawio
│ └── duel.png
└── wcoj
│ ├── hypergraph.drawio
│ ├── hypergraph.png
│ ├── trie.drawio
│ └── trie.png
├── drafts
├── database.md
├── kernel-update.txt
└── relational-calculus.md
└── posts
├── cegis.md
├── datalog-resources.md
├── egraph-inter.md
├── email-cure.md
├── entropic-bounds.md
├── entropy.md
├── grad-ind.md
├── late-materialization.md
├── recursive-agm.md
├── sql-eq.md
├── ssh-image.md
└── wcoj.md
/README.md:
--------------------------------------------------------------------------------
1 | Moved to [remy.wang/blog](https://remy.wang/blog/).
2 |
--------------------------------------------------------------------------------
/assets/cegis/cegis.drawio:
--------------------------------------------------------------------------------
1 | 5VnbctowEP0aHtux5QvOYwqUpmkybUnS5lHYa1uJbXlkEaBfXxlL+AqBKYR0eGG8Ryth7dlzpISeMYgXY4bT8IZ6EPWQ5i16xrCHkGVfiM8cWBZA35RAwIhXQHoJTMgfkKAm0RnxIKslckojTtI66NIkAZfXMMwYndfTfBrVvzXFAbSAiYujNvqLeDwsUMfSSvwLkCBU36xrciTGKlkCWYg9Oq9AxqhnDBilvHiKFwOI8tqpuhTzPm8YXb8Yg4TvMuE+mD/8uB89XNOfxErp0xTd3H3QDflyfKl2DJ4ogAwp4yENaIKjUYl+YnSWeJAvq4uozPlGaSrBJ+B8KdnEM04FFPI4kqOwIPx35flRPGsfLRkN897RVLCUgU8TLhfUHRG3ty8rktEZc2HLnvtFXr7PykRZtDHQGDhbigQGEebkpd4IWPZTsM5bT/1OiXgVpKnW70viZecbqmfUEhyzALicVRInHiqvUUIrOvegVm7zBUczuYUW1XUi5yHhMEnxqnhzoeY6aU0CfBJFAxpRtlrL8H0fua7AM87oM1RGPHtqW/Y2yl6AcVhsJUONOvWiIlvG81KchpJgWBWmo20msFb6fet8UgWVqnmsiqZTQaLqbFmZlIeP1bFy2irapDyIpisfy1kjwin/SYzK01da2JKHjiLaS8bwspKQ5mLMNmt63W6q/S4antvIN7Wt+eKheIODCt86RUPubM+vMv1G9qzrDXu2d7Pn9kLaKz5faOBoPq88tPT5ABJgmAv/Pabhexgcv9PwbdeBqX8Yw0fN6uptw9dRh+E36TyY36NWvd0Q3Gc4brV9x4Xu43XqWKalHabahvneqm2+n9NVjex3uuqnPF3Rjp6rv43nouaVGO1mle1z1X5loSPfre225zIcx7jtAUJ4vN5cdQEnNIGG2iWEIxIkInQF68JcjE+qIy7lQEw8b9XvXc5SamCvP572uImjJgVWyyrMDqdoMnUwp1BWVSElS8E9Y0Ys/cSMtI/KwWh8NTkfSozGJdMUC5+WErNFScpobl7nQ4rVvFK+oXMNb2dXN/rtXTa+nH19yobXBvQ7bvAuTjziYQ7/ASv9NSstCjqI2lkqyDZPy0rbvHwAb4rd5/Mlxbg4GikiLP8XXtzTyh8UjNFf
--------------------------------------------------------------------------------
/assets/cegis/cegis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/remysucre/blog/93d3213aed0712f3b70deaf613565427542e3b65/assets/cegis/cegis.png
--------------------------------------------------------------------------------
/assets/cegis/duel.drawio:
--------------------------------------------------------------------------------
1 | 5VnbcpswEP0aZtqHdAABJo+x46SdNp22TtLmUQYBSgRiZDnG/foKEOYi7DipXXvql4Q9uiDtObtaZA2M4uyawTS6oT4imqn7mQYuNdN0wLn4mwPLEhgAowRChv0SagAT/BtJUJfoHPto1urIKSUcp23Qo0mCPN7CIGN00e4WUNJ+awpDpAATDxIV/Yl9HpWoa+s1/hHhMKrebOiyJYZVZwnMIujTRQMCYw2MGKW8fIqzESK57yq/lOOu1rSuFsZQwrcZcBcu7r/fje8/0x/YTunj1Ly5PRuUszxDMpcblovly8oDjM4TH+WTGBoYLiLM0SSFXt66EJQLLOIxkc0BTbgk0XBzGxMyooSyYi4QBIHpeQKfcUafUKPFd6aO7YgWuSDEOMrW7tRY+U/oDtEYcbYUXaoBrnS51JzpSHtRMwgqnqIGe2YFQqmacDV37VjxIH37Cj8Dxa3IFzqTJmU8oiFNIBnX6LDt+LrPF0pTCT4izpfS33DOaZsMlGH+SzzrH2xpPeSWfL7MmsayMhKx3cag3HxottXDCqsa1yUekWkh9pxILMJpxWy+7828CjfROfPQBndWmQCyEPEN/cx+nTBEIMfP7XXsnHNDia0AJ/7Z5OJ2rzHmQ+QGvTHmeC6aBruJsVWoyBgDhhpjhtkTY86+QsxU3O1FyHs6u/uaexxc7TexuR7qT2xT17ZsfTdOB9axOd06nrxmvCmvGUeV18wt89oamWyd1+TQbxSLFdYxPejIy+zIply/HNUsNToTWc4LE5UbVCYqJLjaz9tV6SipIGQwjiFT1CpikbfF1Q7ghCaoE+0SggSHiTA9QTIS+LBSxIVsiLHvF3rvyyx1DPSJbBc1kNmhABhC+N1kYfUWQfs6D3WFllmKvBPmxHEPzol6ao7G158mp0PK6gyVpFhAPVP/LSWWQknKaJ7ATocUu1tdHkH2Ug8VbTDMCr+bo2Xxf3B5OhQpcaMfOm7Uu4zgXZZz815Y4mG4PF12TAccmB1XDR/l0yHxL/Iru9ytBM5m2LuNcLL5O6yu61u3FXWRv6auL3murvRA0e5fYVK9R9u6qH+xWB/0M9Vgwu5hosL+sqY3upm0+ym4bU3fLV6s7l3Znmt64/z/FNChhGF1LkmBfV4dsa+Vht393FOnerM4hFlfkZfd698ZwPgP
--------------------------------------------------------------------------------
/assets/cegis/duel.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/remysucre/blog/93d3213aed0712f3b70deaf613565427542e3b65/assets/cegis/duel.png
--------------------------------------------------------------------------------
/assets/wcoj/hypergraph.drawio:
--------------------------------------------------------------------------------
1 | 3ZZbb4IwFIB/DY8mtMXb4xS3ZXEvw2R7baRCs8IhtQ70169IudmZuGTGRR9Mz9eeFr4eCg6ZJ8WTpFn8CiETDnbDwiG+g/HIm+j/EuwrgJA3qkgkeWhYCwJ+YAa6hu54yLa9gQpAKJ714RrSlK1Vj1EpIe8P24Dor5rRiFkgWFNh03ceqriik6Hb8mfGo7heGbmmJ6H1YAO2MQ0h7yCycMhcAqiqlRRzJkp5tZcq7/FMb3NhkqXqkgTfXy0//ODBm7kvs2hJMN6NB2Z7vqjYmRs2F6v2tQElOU2jMprlMVcsyOi67Mr1hmsWq0ToCOmmBEUVh1SHg6mrwYYLMQcBUpMU0nIKsx6TihVnbwQ1enRdMUiYkns9pE4YDasUU1PYNXHebtDYSI87ezMxjJqSiJqZW2u6YcT9QmJdqx2LhaVRz6Nr9gKLdJtVhbzhBQtPLDqYbI6/kkOqzAODpjreKgmf7Aq+J33faIos3+QH3+RqvpHle39Pvr1pv75H7o19Y8v34Y58Nyn/xjexfL/9pe+OV++c1ysd3cOToxvfWrVnqQ7uRPXJqd28/m6memipXt2Halw/sNevah22X4zHvs53N1l8Aw==
--------------------------------------------------------------------------------
/assets/wcoj/hypergraph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/remysucre/blog/93d3213aed0712f3b70deaf613565427542e3b65/assets/wcoj/hypergraph.png
--------------------------------------------------------------------------------
/assets/wcoj/trie.drawio:
--------------------------------------------------------------------------------
1 | 5VjLjtsgFP0aL1vZxs5jmTjpVJVGqpRFp6sKGWLTYhMRMrb79cVj8AOcSUadJK2yibgHfIFzDtfBDoiy8oHDXfrIEKaO76LSASvH9yfBTP7WQNUAnhdMGiThBCmsAzbkN1agq9ADQXg/GCgYo4LshmDM8hzHYoBBzlkxHLZldDjrDibYAjYxpDb6jSCRNugsdDv8MyZJqmf2XNWTQT1YAfsUIlb0ILB2QMQZE00rKyNMa/I0L81zn470tgvjOBfnPPAY0cU6+PEFg5/JOn7Y7lfF4YPK8gzpQW1YLVZUmgHODjnCdRLXAcsiJQJvdjCuewupucRSkVEZebKp0mEucHl0nV67e2kbzDIseCWH6AeAIkxbRnNddPzrIWmP+pnCoFI8aTN3pMiG4uUNHPkWRzZJcrtiyMRecPYLR4wyLpGc5XLkcksoNSBISZLLMJb0YIkva/KINOBCdWQEoXqaUeqH4mxZLtQR8qbvpIYfDtWY2GoEI2r4l1IDWGqA+1XDd2+sRvAP1o+Z4dgRjq5aP0KLI++OHDszqrl3Y8dOLDXCO1Zj5N16VTWm/0H9mJ9XP4JLcTS3OAru17E3f+Pp+vWaZXGOFvXdo2O5J4xJkiSCV0+KwZfgex18nIY6XpX93lWlomZajKwLjMGtvDFBnmBx6jVua9DjOBzhWGMcUyjI83AZY8SrGb4yIhfYSTzxhgduami3ZwceY/VU/4ZjJjJPrnkkGx6sRC8+aLf9F9Y44zb1NmuURLTOkO3GGKGKOlvUQdX3yJNKeBk7NXqcruonbXfk6F/HdkZhAeb/v3Nd15YanWj+Xq6TYfd5oBnefWQB6z8=
--------------------------------------------------------------------------------
/assets/wcoj/trie.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/remysucre/blog/93d3213aed0712f3b70deaf613565427542e3b65/assets/wcoj/trie.png
--------------------------------------------------------------------------------
/drafts/database.md:
--------------------------------------------------------------------------------
1 | # 7 Sketches of a Database
2 |
3 | Every scientific revolution results from a shift of perspective.
4 | Quantum mechanics unified the particle & wave perspective on matter;
5 | General relativity weaved together the space & time fabrics of the universe;
6 | Cartesian coordinates
7 |
8 |
9 | ## The Physical Perspective
10 |
11 | ## The Logical Perspective
12 |
13 | ## The Algebraic Perspective
14 |
15 | ## The Combinatorial Perspective
16 |
17 | ## The Geometric Perspective
18 |
19 | ## The Information Theoretic Perspective
20 |
21 | ## Other Perspectives
22 |
--------------------------------------------------------------------------------
/drafts/kernel-update.txt:
--------------------------------------------------------------------------------
1 | mount /dev/nvme0n1p2 /boot
2 | grub-mkconfig -o /boot/grub/grub.cfg
3 | cd /boot rm ...
4 |
--------------------------------------------------------------------------------
/drafts/relational-calculus.md:
--------------------------------------------------------------------------------
1 | # Relative Safety in Relational Calculus
2 |
3 | In classical model theory, there are two important questions:
4 |
5 | 1. **(Finite) Satisfiability**: Given a first-order sentence $\phi$,
6 | is there a (finite) structure $M$ such that $M \models \phi$?
7 |
8 | 2. **Model Checking**: Given a first-order sentence $\phi$ and
9 | a structure $M$, is it true that $M \models \phi$?
10 |
11 | In fact, the history of the first question goes back to the inception of computer science.
12 | In 1936, Alonzo Church and Alan Turing independently proved that Satisfiability for
13 | first-order logic is undecidable, resolving David Hilbert's Entscheidungsproblem.
14 | Their proofs also invented the two pillars supporting modern computer science:
15 | the lambda calculus and the Turing machine.
16 |
17 | A less well-known result is that the *Finite* satisfiability is also undecidable.
18 | Specifically, if a sentence $\phi$ contains a relation of arity at least 2,
19 | then it is undecidable to determine if $\phi$ has a finite model.
20 | This result is due to Boris Trakhtenbrot.
21 |
22 | On the other hand, the Model Checking problem is very easy.
23 | The data complexity (over the size of the model) of Model Checking over FOL
24 | (extended with $\times$ and $+$) is $AC^0$, one of the lowest classes
25 | in the polynomial hierarchy.
26 | And even if we extend FOL with a least-fixpoint operator (i.e. Datalog),
27 | the problem is still in PTIME.
28 |
29 | But in databases, we care about a slightly different question from Satisfiability
30 | and Model Checking:
31 |
32 | 3. **Relation Calculus Query**: Fix a finite structure $M$,
33 | and a first-order sentence $\phi(\mathbf{x})$ over $M$,
34 | where $\mathbf{x}$ are the free variables of $\phi$.
35 | Let $Q$ be a relation symbol.
36 | Find a model for $Q$ such that $\forall \mathbf{x} : Q(\mathbf{x}) \Leftrightarrow \phi(\mathbf{x})$.
37 |
38 | Note that $Q$ may not always have a finite model. For example, if $\phi(x) := x = x$ then $Q$
39 | must contain all values in the domain which can be infinite.
40 | We may therefore want to ask a question similar to Finite Satisfiability:
41 |
42 | 4. **Relative Safety**: Given a Relational Calculus query and a fixed database,
43 | does the query have a finite answer on that database?
44 | That is, is there a finite model for $Q$, defined as above?
45 |
46 | This problem sits in between the Finite Satisfiability problem and the Model Checking problem:
47 | it asks for a model for $Q$, but we already have the models for all other relations.
48 | Relative Safety was proposed by [AGSS86], which is written in Russian.
49 | Other papers cite it for a proof that Relative Safety is decidable,
50 | but I don't read Russian, so I had to do the proofs myself.
51 | Luckily, the proof is actually not very complicated.
52 |
53 | **Theorem** Consider Relational Calculus with the following syntax:
54 |
55 | ```math
56 | Q := \bot \mid \top \mid x = t \mid R(t_1, \ldots, t_k) \mid \neg Q \mid Q \vee Q \mid Q \wedge Q \mid \exists x .Q
57 | ```
58 |
59 | In short, it is the standard FOL where the only interpreted predicate is equality.
60 | The Relative Safety is decidable.
61 |
62 | **Proof**
63 | Because the database contains only a finite set of finite relations,
64 | any infinite query output must contain some value that is not in the database.
65 | The converse is also true: if the query output contains *any* value not in the database,
66 | the output must be infinite.
67 | This is because the query cannot *distinguish* values not in the database,
68 | because the only predicates are relation symbols or =.
69 | Therefore, if we see one output tuple containing some $v$ not in the database,
70 | we can always change $v$ to another value not in the database to get another tuple.
71 |
72 | With the above insight, we can easily test if a query is infinite using a finite
73 | number of values not in the database.
74 | We can't use just one, because the sentence may contain a predicate $v = v'$ where
75 | both $v$ and $v'$ are free, in which case we may want to use two different values for them.
76 | The algorithm is as follows:
77 |
78 | Let $C$ be the (finite) set of values appearing in the database.
79 | Let $V = \{ v_1, \ldots, v_n \}$ be a set of $n$ values not in the database,
80 | where $n$ is the number of free variables of $Q$.
81 | For each tuple $t=(t_1, t_2, \ldots)$ where each $t_i \in C \cup V$,
82 | check if $\phi(t)$ is valid (model checking).
83 | If it is, and if $t$ contains a value not in $C$, then the query is infinite.
84 | If we exhaust all tuples ($C \cup V$ is finite) without finding such a $t$,
85 | the query is finite.
86 |
87 | [AGSS86] Alfred K. Ailamazyan, Mikhail M. Gilula, Alexei P. Stolboushkin, and Grigorii F. Schwartz. Reduction of a relational model with infinite domains to the case of finite domains. Doklady Akademii Nauk SSSR, 286(2):308–311, 1986. URL: http://mi.mathnet.ru/dan47310.
--------------------------------------------------------------------------------
/posts/cegis.md:
--------------------------------------------------------------------------------
1 | # Counterexample-guided Inductive Synthesis
2 |
3 | CEGIS takes a grammar and a specification,
4 | and outputs a program defined in the grammar satisfying the spec.
5 |
6 | Formally, it solves formula of the form
7 |
8 | ```math
9 | \exists f . \forall x, y, \dots . \phi
10 | ```
11 |
12 | where $\phi$ is the specification and $f(x,y,\dots)$ is a program
13 | drawn from the given grammar.
14 |
15 | For example, we can ask for a function satisfying the following spec:
16 |
17 | ```math
18 | \exists f . \forall x, y . f(x,y) \geq x \wedge f(x,y) \geq y \wedge (x = f(x,y) \vee y = f(x,y))
19 | ```
20 | In English, $f(x,y)$ must be no less than both x and y,
21 | and it should be equal to either x or y
22 | (hint: it's the function `max`).
23 |
24 | The essence of CEGIS is illustrated by this picture:
25 |
26 | 
27 |
28 | The generator generates candidate programs drawn from the grammar,
29 | and the checker checks the candidates against the spec for correctness.
30 | If a candidate satisfies the spec,
31 | CEGIS outputs it as the solution.
32 | Otherwise the checker asks the generator for more candidates,
33 | possibly providing some feedback to the generator.
34 |
35 | A naive instantiation of this process
36 | uses exhaustive enumeration as the generator,
37 | and random testing as the checker.
38 | The generator simply proposes programs small to large
39 | by following some BNF grammar,
40 | and the checker runs the candidate on random inputs
41 | and tests if the spec is satisfied.
42 |
43 | Another instantiation of this process
44 | uses human software engineers as the generator,
45 | and human testing engineers as the checker.
46 | But let's focus on machines and ignore humans from now on.
47 |
48 | The naive instantiation has some problems:
49 | 1. the checker is unsound, because random testing is not exhaustive.
50 | 2. the generator running exhaustive enumeration is extremely inefficient.
51 | 3. if a test fails on the checker's side,
52 | the generator doesn't know *why* it failed.
53 | That is, the feedback from the checker is not very informative.
54 |
55 | To fix the first shortcoming, we can replace random testing with
56 | automated verification.
57 | Specifically, we use SMT-solvers,
58 | which either verifies the correctness of the candidate,
59 | or returns a *counterexample*.
60 | The counterexample is an input to the candidate program,
61 | so that running the program on it produces a result that violates the spec.
62 | Using an SMT-solver, our checker guarantees the correctness of
63 | any candidate that passes the check.
64 | It also provides the counterexample as a feedback that explains *why*
65 | a failing candidate is incorrect.
66 |
67 | To be more concrete, let's take our spec for *max* above,
68 | and use the following simple grammar:
69 |
70 | ```
71 | e := x + y | max(x,y)
72 | ```
73 |
74 | Suppose the generator proposes $\max(x,y)$ as a candidate;
75 | the checker then instantiates the spec with $f=\max$:
76 |
77 | ```math
78 | \forall x, y . \max(x,y) \geq x \wedge \max(x,y) \geq y \wedge (x = \max(x,y) \vee y = \max(x,y))
79 | ```
80 |
81 | To check the validity of this formula, the checker simply drops the $\forall$
82 | and asks the solver if the *negation* of the body is satisfiable:
83 |
84 | ```math
85 | \text{SAT?} \Big( \neg \big( \max(x,y) \geq x \wedge \max(x,y) \geq y \wedge (x = \max(x,y) \vee y = \max(x,y)) \big) \Big)
86 | ```
87 |
88 | This is unsatisfiable, meaning *no* values of x and y can make the negated spec true.
89 | That is the same as saying no x and y can make the orignal spec false.
90 | Therefore the candidate is correct.
91 |
92 | If the generator proposes x+y as a candidate,
93 | the checker will instantiate the spec with f=+
94 | and check for the following formula:
95 |
96 | ```math
97 | \text{SAT?} \Big( \neg \big( x+y \geq x \wedge x+y \geq y \wedge (x = x+y \vee y = x+y) \big) \Big)
98 | ```
99 |
100 | Clearly this is satisfiable: when x=y=1, x+y=2 equals to neither x nor y.
101 | The checker therefore rejects the candidate
102 | and provides {x=1, y=1} as the counterexample.
103 |
104 | The next step is to improve the generator,
105 | so that it can take advantage of the feedback
106 | when it proposes the next candidate.
107 | Remarkably, this too can be implemented with an SMT-solver!
108 | The idea is, we want to find a program that satisfies the spec
109 | *on all counterexamples encountered so far*.
110 | This program also needs to be drawn from the grammar.
111 |
112 | The important insight is that **the grammar can be expressed as an SMT formula**!
113 | To do so, we simply introduce a boolean "switch" for every choice in the grammar.
114 |
115 | For example, our grammar above becomes:
116 |
117 | ```
118 | if?b then:f=x+y else:f=max(x,y)
119 | ```
120 |
121 | In first order logic, this is
122 |
123 | ```math
124 | (b \Rightarrow f(x,y) = x + y) \wedge (\neg b \Rightarrow f(x,y) = max(x,y))
125 | ```
126 |
127 | With this encoding, we just need to decide the right value for b such that
128 | the spec is satisfied:
129 |
130 | ```math
131 | (b \Rightarrow f = +) \wedge (\neg b \Rightarrow f = \max) \wedge
132 | \forall x, y . f(x,y) \geq x \wedge f(x,y) \geq y \wedge (x = f(x,y) \vee y = f(x,y))
133 | ```
134 |
135 | There is one last wrinkle: we need to remove the $\forall$ quantification,
136 | since quantified formula are usually undecidable.
137 | So instead of solving for all x and y, we instantiate the formula with the
138 | counterexamples:
139 |
140 | ```math
141 | b \Rightarrow f = + \wedge \neg b \Rightarrow f = \max \wedge
142 | f(1,1) \geq 1 \wedge f(1,1) \geq 1 \wedge (1 = f(1,1) \vee 1 = f(1,1))
143 | ```
144 | After "inlining" f, this becomes:
145 |
146 | ```math
147 | b \Rightarrow 1+1 \geq 1 \wedge 1+1 \geq 1 \wedge (1 = 1+1 \vee 1 = 1+1)
148 | ```
149 | ```math
150 | \wedge \neg b \Rightarrow \max(1,1) \geq 1 \wedge \max(1,1) \geq 1 \wedge (1 = \max(1,1) \vee 1 = \max(1,1))
151 | ```
152 |
153 | Now it's very easy to see that b must be false to make the above true,
154 | and an SMT-solver would return exactly that.
155 | We encountered only 1 counterexample pair {x=1, y=1};
156 | had we found more, the generator would instantiate the formula for each of them.
157 | For example, if we also have {x=2, y=2} as an CE, the instantiation would be:
158 |
159 | ```math
160 | b \Rightarrow 2+2 \geq 2 \wedge 2+2 \geq 2 \wedge (2 = 2+2 \vee 2 = 2+2)
161 | ```
162 | ```math
163 | \wedge \neg b \Rightarrow \max(2,2) \geq 2 \wedge \max(2,2) \geq 2 \wedge (2 = \max(2,2) \vee 2 = \max(2,2))
164 | ```
165 | ```math
166 | \wedge b \Rightarrow 1+1 \geq 1 \wedge 1+1 \geq 1 \wedge (1 = 1+1 \vee 1 = 1+1)
167 | ```
168 | ```math
169 | \wedge \neg b \Rightarrow \max(1,1) \geq 1 \wedge \max(1,1) \geq 1 \wedge (1 = \max(1,1) \vee 1 = \max(1,1))
170 | ```
171 |
172 | This encoding can be easily extended to recusive grammars,
173 | if we bound the depth of the grammar and gradually increase this bound.
174 | The reader can take up as a challenge to figure out how exactly to achieve this.
175 |
176 | Stepping back, we see that the generator and checker of CEGIS are implemented by a pair of **dueling solvers**,
177 | as shown in the picture below.
178 |
179 | 
180 |
181 | On the checker's side:
182 | 1. the spec is instantiated with f replaced by the candidate program,
183 | 2. the solver checks the *validity* of the instantiation:
184 | it outputs the candidate when valid, and gives a counterexample otherwise.
185 |
186 | On the solver's side:
187 | 1. the spec is instantiated with f drawn from the grammar,
188 | as well as x,y bound to the CEs.
189 | 2. the solver tries to *satisfy* the instantiation:
190 | it proposes a candidate if the formula is satisfiable.
191 | Otherwise, the synthesis problem has no solution.
192 |
193 | CEGIS is effective
194 | because the generator constantly improves its proposal upon every feedback.
195 | Because the feedback from the checker accumulates more and more counterexamples,
196 | the generator will propose programs that are correct on more and more inputs.
197 | In a sense, the candidate gets "more and more correct" every time the loop goes around.
198 | And when the checker finally OK's the candidate, it is guaranteed to be sound.
199 | Compare this to exhaustive enumeration: every time the loop comes around,
200 | there's no guarantee that the new candidate will be "more correct".
201 | Even though enumerating the next expression is much faster than SMT-solving,
202 | exhaustive enumeration would still incur exponentially many calls to the checker.
203 |
204 | ## Acknowledgement
205 |
206 | I first learned of CEGIS from [Emina Torlak](https://homes.cs.washington.edu/~emina/)'s
207 | [course](https://courses.cs.washington.edu/courses/cse507/) at UW.
208 | After that I keep going back to
209 | [these](https://www.cs.utexas.edu/~bornholt/post/synthesis-explained.html)
210 | [posts](https://www.cs.utexas.edu/~bornholt/post/building-synthesizer.html)
211 | by [James Bornholt](https://www.cs.utexas.edu/~bornholt/)
212 | for reference.
213 | But the posts don't explain how the solver works in the generator,
214 | and that's why I wrote this post.
215 |
216 | I highly recommend [Rosette](https://emina.github.io/rosette/)
217 | if you want to get your hands dirty syntheisizing some programs -
218 | it's a rock-solid language that is an absolute joy to use.
219 |
220 | ## References
221 |
222 | Armando Solar-Lezama, Liviu Tancau, Rastislav Bodik, Sanjit Seshia, and Vijay Saraswat. 2006. Combinatorial sketching for finite programs. In Proceedings of the 12th international conference on Architectural support for programming languages and operating systems (ASPLOS XII). Association for Computing Machinery, New York, NY, USA, 404–415. DOI:https://doi.org/10.1145/1168857.1168907
223 |
--------------------------------------------------------------------------------
/posts/datalog-resources.md:
--------------------------------------------------------------------------------
1 | # Datalog Resources
2 |
3 | I love it when people ask me about Datalog.
4 | I also find myself pointing them to some very good resources.
5 | To save the trouble of searching through email threads every time,
6 | I'm dumping these links here for future reference.
7 |
8 | My favorite tutorial is
9 | [Introduction to Datalog](https://x775.net/2019/03/18/Introduction-to-Datalog.html)
10 | by x775.
11 | It has a lot of examples conveying intuition, but is also very rigorous.
12 | For a more interactive experience, https://percival.ink is a Datalog notebook running
13 | in your browser.
14 | If you want to be even more rigorous, the [Alice book](http://webdam.inria.fr/Alice/)
15 | is **the** definitive textbook for foundation of DB.
16 |
17 | When you're ready to write real Datalog programs,
18 | check out [souffle](https://github.com/souffle-lang/souffle)
19 | which is a high quality, blazing fast implementation used
20 | in industry and research.
21 | Another good language is [flix](https://flix.dev) which started
22 | as an extension of Datalog, but has become much more than that.
23 |
24 | If you have your favorite Datalog resource and don't see it here,
25 | feel free to suggest it by opening a PR!
--------------------------------------------------------------------------------
/posts/egraph-inter.md:
--------------------------------------------------------------------------------
1 | # E-Graph Intersection
2 |
3 | A Rust implementation of the algorithm based on
4 | [egg](https://egraphs-good.github.io/)
5 | can be found [here](https://github.com/remysucre/eggegg).
6 |
7 | **Definition** [e-graph intersection]
8 | An e-graph represents a (potentially infinite) set of terms and
9 | an equivalence relation over the terms.
10 | The intersection of two e-graphs G1, G2
11 | is an e-graph G that satisfies the following:
12 | 1. the set of terms it represents
13 | is the intersection of the sets of terms represented by G1, G2, and
14 | 2. two terms are equal (according to the equivalence relation) in G
15 | if and only if they are equal in both G1 and G2.
16 |
17 | To be even more formal, define $G = (T, E)$
18 | where T is the set of terms and E the equivalence relation,
19 | and similarly $G_1 = (T_1, E_1), G_2 = (T_2, E_2)$.
20 | Then $T = T_1 \cap T_2$ and
21 | $(t_1, t_2) \in E \Leftrightarrow (t_1, t_2) \in E_1 \wedge (t_1, t_2) \in E_2$.
22 |
23 | **Algorithm** [intersecting e-graphs]
24 | Given two e-graphs G1, G2, compute their intersection G.
25 |
26 | Observe that G’s equivalence relation E must be a refinement of E1 and E2,
27 | because if $(t_1, t_2) \in E$ we must have $(t_1, t_2) \in E_1$
28 | and $(t_1, t_2) \in E_2$,
29 | but if $(t_a, t_b) \in E_1$ we might not have $(t_a, t_b) \in E$,
30 | and similar for E2.
31 | Given this observation, we can define the following maps:
32 |
33 | - $E \rightarrow E_2$ will map each class in E to a corresponding class in E2,
34 | - $E_1 \rightarrow \{E\}$ will map each class in E1 to a (possibly empty) set of classes in E.
35 |
36 | A class in E1 may map to the empty set
37 | if the terms it contains are not represented in G.
38 | The main purpose of these maps is to relate eclasses across G, G1 and G2,
39 | so that we can for example check if two enodes in G are equal by asking G1 and G2.
40 | We initialize all maps to be empty at the beginning.
41 | The intersection algorithm proceeds as follows:
42 |
43 | ```
44 | while G changes:
45 | for class in G1.classes:
46 | for op(c1,...,cn) in class.nodes:
47 | // map child classes to classes in G
48 | for op(c1’,...,cn’) where c1’∈E1→E[c1],...,cn’∈E1→E[cn]
49 | // only add the node if it’s also in G2
50 | if let Some(c2) = G2.get(op(E→E2[c1’],...,E→E2[cn’]))
51 | cnew = G.add(op(c1’,...,cn’))
52 | E→E2[cnew] = c2
53 | E1→E[class].insert(cnew)
54 | // this loop is expensive but seems necessary, because the
55 | // new node may only be equiv. to some nodes & not others
56 | for c in E1→E[class]
57 | if G2.find(E→E2[c]) = G2.find(c2): G.union(cnew,c)
58 | ```
59 |
60 | In English, we repeatedly scan the nodes in G1 and try to add them to G.
61 | Nodes will get added bottom-up starting with the leaves
62 | (we can’t add a node without adding its children first),
63 | and a node only gets added if it’s in G2 as well.
64 | But one node in G1 may become multiple nodes in G,
65 | because any of its child class may have been split into multiple classes in G.
66 | So we follow the map $E_1 \rightarrow \{E\}$ and create the new nodes accordingly.
67 | After adding a new node to G, we update the maps.
68 | Because all nodes we added from the same class in G1
69 | ($E_1 \rightarrow \{E\}[\text{class}]$)
70 | are equivalent according to G1,
71 | we only need to check if they are also equivalent under G2.
72 | If any pair are equivalent, we union them in G.
73 | We repeat this process until G stops changing, much like equality saturation.
74 | In fact, we can roughly think of the process as an equality saturation
75 | where the rewrite rules are given by G1 and G2.
76 |
77 | **Correctness Theorem**
78 | When the algorithm terminates (it always does),
79 | G is the intersection of G1 and G2.
80 |
81 | *Termination*: we first prove G must be finite.
82 | Assume the contrary, i.e. G has finitely many classes but infinite enodes,
83 | or infinitely many classes.
84 | The first case is impossible,
85 | because with finite function symbols and finite classes
86 | we can only construct a finite number of enodes
87 | (multiple copies of identical enodes are hash-cons’ed).
88 | If there are infinite classes in G,
89 | then G must have infinite classes that correspond to the same class in G1.
90 | But the fact that they are not unioned in G means
91 | they must each correspond to a different class in G2,
92 | which is impossible because the latter is finite.
93 |
94 | Now it is easy to show the algorithm terminates.
95 | Because union always reduces the number of classes,
96 | the only way for G to keep changing is to keep adding new enodes.
97 | But this will result in an infinite number of enodes, which is impossible.
98 |
99 | *Correctness*: the correctness of the algorithm consists of the following:
100 | - *Representation soundness*: every term represented in G is represented in G1 and G2,
101 | - *Representation completeness*: if a term is represented in both G1 and G2, it is in G,
102 | - *Equality soundness*: every two term equal in G are equal in both G1 and G2,
103 | - *Equality completeness*: if two terms are equal in both G1, G2, they are equal in G.
104 |
105 | The following needs work -
106 | these reply on some invariants about the maps $E \rightarrow E_2$ and $E_1 \rightarrow \{E\}$:
107 | - If $c_2 = E \rightarrow E_2[c]$, then the terms represented by c are all in c2.
108 | - If $c \in E_1 \rightarrow \{E\}[c_1]$, then the terms represented by c1 are all in c.
109 |
110 | *Representation soundness*: every enode added to G is constructed from an enode in G1,
111 | and we only add it if it’s also found in G2.
112 |
113 | *Equality Soundness*: union’s arguments come from the same eclass in G1,
114 | and we only union if they also correspond to the same eclass in G2.
115 |
116 | Completeness should be proved by induction on the terms.
117 |
118 | **Acknowledgement**
119 | The idea of intersecting e-graphs is due to
120 | [Altan](https://altanh.com/) and [Josh](https://joshmpollock.com/);
121 | they [implemented it](https://github.com/uwplse/unscramble) before I did.
122 | I wrote my implementation from scratch,
123 | but I suspect it is equivalent to theirs.
124 | Altan and Josh were inspired by the [work](https://doi.org/10.1145/3133886)
125 | using tree automata for synthesis.
126 | The paper below by Gulwani, Tiwari \& Necula is also relevant;
127 | I suspect it achieves percisely e-graph interesection.
128 |
129 | **References**
130 |
131 | Burghardt, Jochen. "E-generalization using grammars." Artificial intelligence 165.1 (2005): 1-35.
132 |
133 | Gulwani, Sumit, Ashish Tiwari, and George C. Necula. "Join algorithms for the theory of uninterpreted functions." International Conference on Foundations of Software Technology and Theoretical Computer Science. Springer, Berlin, Heidelberg, 2004.
134 |
135 | Xinyu Wang, Isil Dillig, and Rishabh Singh. 2017. Synthesis of data completion scripts using finite tree automata. Proc. ACM Program. Lang. 1, OOPSLA, Article 62 (October 2017), 26 pages. DOI:https://doi.org/10.1145/3133886
136 |
--------------------------------------------------------------------------------
/posts/email-cure.md:
--------------------------------------------------------------------------------
1 | # Filter Every Mail: a Cure for Email Addiction
2 |
3 | Years ago I purged social media from my life,
4 | and I never looked back.
5 | I could not afford to quit email though,
6 | so I turned off the notification,
7 | only to find myself compulsively checking
8 | my inbox every 15 minutes.
9 | This week, I finally found a simple cure
10 | for my email addiction:
11 | filter every mail.
12 |
13 | It's as easy as it sounds:
14 | I set up a filter to have every single mail
15 | skip the inbox, so my inbox is always empty.
16 | Inbox zero by construction!
17 | One way to do this in Gmail is to create a filter
18 | for any message larger than 0MB.
19 |
20 | Some benefits of this method:
21 | 1. It needs no plugins.
22 | 2. It works on the web, email clients, and mobile apps.
23 | 3. When I need to find information from past emails, I can still log in without seeing new messages.
24 |
25 | The third point is very important:
26 | before this I have tried logging out,
27 | using blockers,
28 | or turning off sync from my email client,
29 | but sometimes I just need to look up this link
30 | that someone sent me either yesterday or a minute ago.
31 | Now I can just do my searching without getting
32 | jabbed by dopamine.
33 |
34 | I still need to actually check for new mails though,
35 | but I have enough self control to check "all mails"
36 | only once every couple days.
37 |
--------------------------------------------------------------------------------
/posts/entropic-bounds.md:
--------------------------------------------------------------------------------
1 | # Fundamental Entropic Bounds
2 |
3 | When my advisor Dan Suciu taught me entropy, he said everyone should know 3
4 | inequalities: the "entropy range", monotonicity, and submodularity. Luckily I
5 | don't have to memorize the bounds as each inequality has a very simple intuition.
6 | First, the "entropy range" simply bounds the value of any entropy function:
7 |
8 | ```math
9 | 0 \leq H(X) \leq \log(|X|)
10 | ```
11 |
12 | On one hand, the entropy is 0 if $X$ takes a certain value $A$ with probabilty
13 | $1$. In that case, we know $X$'s value ($A$) without communicating a single bit.
14 | On the other hand, we can always simply use full binary encoding with
15 | $\log(|X|)$ bits to encode $X$, ignoring the probability distribution.
16 |
17 | **Monotonicity** says that the entropy of a string of random variables is no less than
18 | the entropy of any substring:
19 |
20 | ```math
21 | H(X) \leq H(XY)
22 | ```
23 |
24 | Here $XY$ simply "concatenates" $X$ and $Y$, in that a value of $XY$ concatenates
25 | a value of $X$ with a value of $Y$. The entropy $H(XY)$ is the number of bits
26 | necessary to transmit a string in $XY$. With this in mind, monotonicity simpy says
27 | that transmitting more information requires more bits.
28 |
29 | Finally, our last inequality, submodularity, conveys the intuition of
30 | "diminishing returns": 10 dollars matter less to a millionaire than to a PhD
31 | student. More concretely, suppose we have a function $f$ from wealth to
32 | quality of life. Then $f$ is submodular because $f(x + \delta) - f(x)$ gets
33 | smaller and smaller as $x$ increases. In the context of information theory, $H$
34 | is submodular because adding additional information to a long message takes
35 | little effort. For example, suppose a submarine needs to send reports describing
36 | the fish it finds, and the description includes weight, length and species. Then
37 | if it says the fish is 80 feet long, you'll know it's a blue whale without
38 | looking at the species field. In general, we can save some bits by inferring
39 | facts from the correlation of data; if all variables are independent we can save
40 | nothing. With this intuition, let's look at the formal statement of
41 | **submodularity**:
42 |
43 | ```math
44 | H(X) + H(Y) \geq H(X \cup Y) + H(X \cap Y)
45 | ```
46 |
47 | Rearranging, we get $H(X \cup Y) - H(X) \leq H(Y) - H(X \cap Y)$. Note $(X
48 | \cup Y) - X = Y - (X\cap Y)$, and if we define $\delta = Y - (X\cap Y)$, the
49 | inequality becomes $H(X + \delta) - H(X) \leq H((X\cap Y) + \delta) - H(X\cap
50 | Y)$, which states precisly "the law of diminishing returns" because $X \geq
51 | (X\cap Y)$!
52 |
--------------------------------------------------------------------------------
/posts/entropy.md:
--------------------------------------------------------------------------------
1 | # Entropy
2 |
3 | Entropy can be understood as the minimum amount of data needed to be transmitted
4 | in order to communicate a piece of information. Concretely, this is the average
5 | number of bits needed to encode a message. For example, imagine a spaceship
6 | sends a status code every minute to indicate if it has found any alien
7 | civilization. The code is any letter from the English alphabet, with $A$ meaning
8 | "nothing new", and some other letter describing the alien. For example $F$ means
9 | the alien is friendly, and $B$ means the alien is blue, etc. For simplicity
10 | assume every code is only 1-letter long. Then we can simply encode this status
11 | code with $\log(26)$ bits, where $A = 0 \dots 0, B = 0\dots 1, \dots$. However,
12 | we can be a little clever because we know most of the time the spaceship won't
13 | be finding new civilizations. In that case, the satellite can remain silent to
14 | indicate "nothing new"; otherwise it sends the status code with our original
15 | encoding [^1].
16 |
17 | Then we only need to send on average little more than 0 bit per minute. In
18 | general, we can save some bits if we know certain messages occur with high/low
19 | probability. In other words, the minimum commucation cost depends on the
20 | probability distribution of the data. Entropy precisely formalizes this
21 | intuition. Formally, if $X$ is a random variable with outcomes $x_1, \dots, x_N$
22 | each of probabilities $p_1, \dots, p_N$, then its **entropy** is defined as:
23 |
24 | ```math
25 | H(X) = \sum_i p_i \log \frac{1}{p_i}
26 | ```
27 |
28 | This matches our intuition: when $X$ is uniform and $|X| = N$, $H(X)=N(1/N
29 | \log N)=\log N$; when $X$ is almost always a certain message, say $A$, then
30 | $H(X)= p_A \log \frac{1}{p_A} + \sum_{i \not = A} p_i \log \frac{1}{p_i} =
31 | 0.99999 \log \frac{1}{0.99999} + \delta \approx 0$. For a more general case,
32 | suppose message $A$ occurs half of the time, $B$ one quarter of the time,
33 | $C$ one eighth and so on. Then we can use one bit $1$ to indiate that $A$
34 | occurs; otherwise we first send one bit $0$ to indicate it's not $A$, then
35 | send one bit $1$ if $B$ occurs and $0$ otherwise, and so on. On average we
36 | need $p_A \times 1 + p_B \times 2 + \dots = p_A \times \log(\frac{1}{p_A}) +
37 | p_B \times \log(\frac{1}{p_B}) + \dots = H(X)$ bits.
38 |
39 | [^1]: This works fine because we assume the spaceship is not dead. Otherwise we can
40 | have the spacehip send a single bit $0$ every minute to signal it's alive; when it
41 | finds an alien, we could also prefix the status code with $1$ so that the message
42 | doesn't get mangled with the alive-bits.
43 |
--------------------------------------------------------------------------------
/posts/grad-ind.md:
--------------------------------------------------------------------------------
1 | # A Canonicity Proof via Gradient Induction
2 |
3 | A canonical form for a language defines a representative among an equivalent class of terms.
4 | It can help identify equivalent terms in the language.
5 | Here I present a proof for the canonicity of the sum-over-product normal form for arithmetics,
6 | to demonstrate an interesting technique that I call induction over derivatives.
7 | A more catchy name I thought of is gradient induction.
8 |
9 | First let's clarify what we mean by canonical form:
10 | when two expressions in a language, considered as programs,
11 | evaluate to the same result on all possible inputs,
12 | we say they are semantically equivalent.
13 | We therefore hope to find a "standard way" to write such expressions,
14 | so that when we rewrite any two expressions to the standard form,
15 | we can immediately tell if they are semantically equivalent by just looking at them.
16 | Such a standard form is considered canonical -
17 | we say that two terms, e1 and e2,
18 | share the same canonical form if and only if they are semantically equivalent,
19 | where semantically equivalent means the terms always compute the same result given same inputs.
20 |
21 | Formally:
22 |
23 | ```math
24 | \text{canonicalize}(e_1) \equiv \text{canonicalize}(e_2) \Leftrightarrow \forall x . \text{eval}(e_1, x) = \text{eval}(e_2, x)
25 | ```
26 |
27 | Here $\equiv$ denotes syntactic equality,
28 | and = denotes semantic (value) equality.
29 | In our case, the expressions are in the language of arithmetics $(+, \times, x, \mathbb{R})$:
30 |
31 | **Definition** an *arithmetic expression* is either a variable,
32 | a constant, the sum of two expressions, or the product of two expressions.
33 |
34 | And our normal form is essentially the standard polynomials:
35 |
36 | **Definition** the sum-over-product normal form of an arithmetic expression
37 | is the sum of products of literals,
38 | where a literal is either a variable or a constant.
39 | Furthermore, we combine monomials that only differ in their coefficients,
40 | e.g. 2xy + 3xy is rewritten into 5xy.
41 |
42 | One can rewrite an expression to SoP with the standard laws
43 | (associativity, commutativity & distributivity).
44 | That is, we keep pulling + up and merge terms.
45 | For example, the SoP canonical form of (x + z) (x + y) is x2 + xy + xz + yz.
46 |
47 | **Proposition** the sum-over-product normal form is canonical:
48 |
49 | ```math
50 | C_{sop}(e_1) \equiv C_{sop}(e_2) \Leftrightarrow \forall x . \text{eval}(e_1, x) = \text{eval}(e_2, x)
51 | ```
52 |
53 | **Proof** the left-to-right direction can be proven by structural induction
54 | over the SoP normal form syntax,
55 | together with the fact that
56 | the standard rewrite rules we use preserve the semantics of arithmetics.
57 |
58 | I now prove the contrapositive of the backward direction:
59 |
60 | ```math
61 | C_{sop}(e_1) \not\equiv C_{sop}(e_2) \Rightarrow \exists x . \text{eval}(e_1, x) \neq \text{eval}(e_2, x)
62 | ```
63 |
64 | There are two cases for $C_{sop}(e_1) \not\equiv C_{sop}(e_2)$:
65 | 1. e1 and e2 differ in their constant term
66 | (e.g. e1 = 2xy + 4 and e2 = 3yz + 7), and
67 | 2. otherwise (e.g. e1 = 2xy + 4 and e2 = 3yz + 4).
68 |
69 | Note that we only look at the lonely constants,
70 | not the coefficients in other terms.
71 |
72 | Case 1 is simple,
73 | since if two expressions have different constants
74 | they evaluate to different results (i.e the constants themselves) on all-zero inputs.
75 |
76 | To prove case 2, I break down the goal into two steps:
77 |
78 | ```math
79 | C_{sop}(e_1) \not\equiv C_{sop}(e_2) \Rightarrow \exists y . \frac{\partial e_1}{\partial y} \neq \frac{\partial e_2}{\partial y}
80 | ```
81 |
82 | and
83 |
84 | ```math
85 | \exists y . \frac{\partial e_1}{\partial y} \neq \frac{\partial e_2}{\partial y} \Rightarrow \exists x . \text{eval}(e_1, x) \neq \text{eval}(e_2, x)
86 | ```
87 |
88 | Recall that $\neq$ is semantic inequivalence.
89 |
90 | The latter is simple:
91 | pick x1 and x2 that only differ in the y variable (from $\partial y$ above).
92 | Since the derivatives differ,
93 | we can always find a pair of x1 and x2 such that
94 | either $\text{eval}(e_1, x_1) \neq \text{eval}(e_2, x_1)$
95 | or $\text{eval}(e_1, x_2) \neq \text{eval}(e_2, x_2)$.
96 |
97 | To prove $C_{sop}(e_1) \not\equiv C_{sop}(e_2) \Rightarrow \exists y . \partial e_1 / \partial y \neq \partial e_2 / \partial y$,
98 | we perform induction over the derivatives of the expressions,
99 | with our original proof goal as the inductive hypothesis:
100 |
101 | ```math
102 | C_{sop}(e_1) \equiv C_{sop}(e_2) \Leftrightarrow \forall x . \text{eval}(e_1, x) = \text_{eval}(e_2, x)
103 | ```
104 |
105 | Since $C_{sop}(e_1) \not\equiv C_{sop}(e_2)$,
106 | we know $\exists y . \partial e_1 / \partial y \not\equiv \partial e_2 / \partial y$ (syntactically).
107 | Since the derivative of a canonical form is also canonical
108 | (not that obvious, but you'll see it after thinking a little harder),
109 | by our inductive hypothesis,
110 | $\exists y . \partial e_1 / \partial y \neq \partial e_2 / \partial y$ (semantically).
111 |
112 | The preceding induction is sound because taking the derivative makes any expression simpler,
113 | eventually bringing it to a constant. $\blacksquare$
114 |
115 | The main takeaway here is that,
116 | when we want to perform an inductive proof,
117 | we need every inductive step to make the terms simpler.
118 | Usually this is done by structural induction over the language definition;
119 | but when the language is differentiable,
120 | the derivative is another tool for inductively simplifying the terms.
121 | This can come handy for the PL researcher,
122 | since we now know a wide range of languages are differentiable – not just polynomials!
123 |
124 | p.s. [Haotian Jiang](https://jhtdavid96.wixsite.com/jianghaotian)
125 | & [Sorawee Porncharoenwase](https://homes.cs.washington.edu/~sorawee/en/)
126 | pointed out a much simpler proof:
127 | given two semantically equivalent arithmetic expressions,
128 | their difference is always zero.
129 | Therefore, the expression that represents the difference has infinitely many roots.
130 | According to the fundamental theorem of algebra,
131 | the two expressions must be the same polynomial,
132 | since otherwise their difference would be a none-zero polynomial and has finitely many roots.
133 | Nevertheless, the main point of this post isn't to prove normalization,
134 | but to showcase the technique of gradient induction.
135 |
--------------------------------------------------------------------------------
/posts/late-materialization.md:
--------------------------------------------------------------------------------
1 | # Late materialization is (almost) worst-case optimal
2 |
3 | In our SIGMOD'23 [paper](https://arxiv.org/abs/2301.10841) we proposed
4 | a new join algorithm called *free join* that generalizes and unifies
5 | both traditional binary joins and worst-case optimal joins (WCOJ).
6 | This bridge can be very useful in practice, because it brings WCOJ closer
7 | to traditional algorithms, making it easier to adopt in existing systems.
8 | Indeed, WCOJ has not seen wide adoption mainly because it seems so different
9 | from what's already implemented in databases.
10 | The surprise in this post is that your database is probably already almost
11 | worst-case optimal, and only some small changes are needed to get the last mile.
12 | In particular, if your system implements late materialization, then you're
13 | in good shape.
14 | You just need the following to get worst-case optimality:
15 |
16 | 1. Make the late materialization a little more aggressive
17 | 2. Change the query plan a bit
18 | 3. Add a very simple adaptive processing primitive
19 |
20 | ## A simple fact about AGM
21 |
22 | This whole WCOJ line of work goes back to the AGM bound,
23 | and I've written about it [before](wcoj.md).
24 | A very simple and useful property of the AGM bound is *decomposability* (Lemma 4.1, [NRR'13](https://arxiv.org/abs/1310.3314)),
25 | demonstrated with the triangle query here:
26 |
27 | ```math
28 | \sum_{(a, b) \in R} \text{AGM}\left( \texttt{Q(z) = S(a,z), T(z,b)} \right)
29 | \leq \text{AGM}\left( \texttt{Q(x,y,z) = R(x,y), S(x,z), T(z,y)} \right)
30 | ```
31 | What this means is that instead of the standard Generic Join where we build a trie
32 | for each relation and do intersections,
33 | we can skip trie building for one of the relations and simply iterate over it.
34 | That is, the following is still worst-case optimal:
35 |
36 | ```
37 | for a, b in R
38 | for c in S[a].z ^ T[b].z
39 | output(a, b, c)
40 | ```
41 |
42 | This is the first step bringing us closer to binary join from WCOJ,
43 | because in binary (hash) join we only build hash on one side and iterate over the other.
44 |
45 | ## Late materialization
46 |
47 | Late materialization is one of the ideas with the most bang-for-the-buck:
48 | it's very simple yet can lead to dramatic speedup.
49 | To illustrate with an example, consider the query `Q(x, u, v) :- R(x), S(x, u), T(x, v)`.
50 | This is a simplified version of the "clover query" in our free join [paper](https://arxiv.org/abs/2301.10841).
51 | Now imagine the `x` column contains book titles, i.e. it's short,
52 | but the `u, v` columns contain the content of books, i.e. very long.
53 | Naive binary join will first join `R` and `S`, and loop over each result to join with `T`.
54 | That is:
55 |
56 | ```
57 | # hash S on x, hash T on x
58 |
59 | for x in R:
60 | us = S[x]? # ? means continue to enclosing loop if the lookup fails
61 | for u in us:
62 | vs = T[x]?
63 | for v in vs:
64 | output(x, u, v)
65 | ```
66 |
67 | The second loop is a bit silly, because we are retrieving the `u`s even though
68 | we don't need them to probe into `T`.
69 | And getting each `u` is expensive, because they contain the entire content of a book.
70 | The idea of late materialization is to delay actually retrieving each `u` until
71 | we are ready to output in the innermost loop.
72 | For example, we can simply iterate over an array of pointers to the `u`s,
73 | and dereference only at the end.
74 |
75 | ## More aggressively late
76 |
77 | To get more performance, we need to be more aggressive in being late.
78 | Instead of delaying dereference during iteration,
79 | we delay the entire iteration until it's needed.
80 | Using our example, we'll push the second loop to run after the lookup on `T`:
81 |
82 | ```
83 | # hash S on x, hash T on x
84 |
85 | for x in R:
86 | us = S[x]? # ? means continue to enclosing loop if the lookup fails
87 | vs = T[x]?
88 | for u in us:
89 | for v in vs:
90 | output(x, u, v)
91 | ```
92 |
93 | Probing into both `S` and `T` first will filter out some `u`s and `v`s
94 | so we won't have to iterate over those.
95 |
96 | Another limitation found in many systems is that they only delay the materialization
97 | of non-join attributes.
98 | In general, you might also want to delay materializing join attributes as well.
99 | The triangle query is one example, where all attributes are joined on.
100 | Furthermore, when building hash people usually hash on *all* join attributes.
101 | To get worst-case optimality, we'll want to hash on some of the attributes first,
102 | and delay the rest until later.
103 | For more details see the COLT data structure in our [paper](https://arxiv.org/abs/2301.10841).
104 |
105 | ## Changing the plan
106 |
107 | In the above when we pushed down the second loop, we already broke away from the original
108 | binary plan that computes the join of `R` and `S` first.
109 | A simple way to guarantee worst-case optimality now is to go all the way and
110 | use a generic join plan.
111 | A generic join plan is simply a total ordering of all the attributes.
112 | For example, `[x, y, z]` for the triangle query and `[x, u, v]` for our second example query above.
113 | During execution, we'll join all the relations that have the first variable first, then
114 | go to those with the second variable, and so on.
115 | While joining on a variable, we delay the materialization of all other variables,
116 | and this essentially implements the "intersection" from generic join.
117 |
118 | However, it's debatable if we *should* go all the way to a generic join plan just because we can.
119 | If the optimizer picks a particular binary plan, it's probably because it thinks the plan is fast.
120 | In the paper we take a more conservative approach and greedily transform the binary plan
121 | towards WCOJ, while avoiding accidental slowdowns.
122 |
123 | ## Adaptive processing
124 |
125 | A subtle detail of WCOJ is that every intersection must take time linear
126 | to the size of the smallest relation;
127 | otherwise the algorithm won't be worst-case optimal.
128 | In hash join, we intersect by iterating over the keys of one relation,
129 | and probing into the others.
130 | A simple way to satisfy the requirement is to simply pick the smallest
131 | relation to iterate over at run time.
132 | This will guarantee worst-case optimality, but there's an interesting trade-off:
133 | iterating over the smallest relation means we now have to build
134 | hash on the larger relations, and this is expensive.
135 | On the other hand, if we iterate over the largest one,
136 | we save time hashing it.
137 | In practice, the time saved can be significant, and may outweigh the
138 | warm and fuzzy feeling of worst-case optimality.
139 |
140 | ## Conclusion
141 |
142 | At this point, even if you're not convinced you can massage your join
143 | implementation into one that is worst-case optimal,
144 | I hope your suspicion is strong enough that you'll try it.
145 | I think the COLT data structure is really central to all of this,
146 | and if you see you can implement COLT with late materialization,
147 | the rest is simple.
--------------------------------------------------------------------------------
/posts/recursive-agm.md:
--------------------------------------------------------------------------------
1 | # Bounding the run time of recursive queries with the AGM bound
2 |
3 | **Acknowledgements** All ideas in this post are due to @mabokhamis and @hung-q-ngo;
4 | I'm just writing them down so that I don't have to bother them when I forget.
5 |
6 | In an [earlier post](wcoj.md) I talked about how to use the AGM bound
7 | to bound the run time of the Generic Join algorithm.
8 | It turns out we can sometimes bound the run time of recursive queries as well.
9 | Consider the transitive closure program in Datalog:
10 |
11 | ```
12 | P(x, y) :- E(x, y).
13 | P(x, y) :- E(x, z), P(z, y).
14 | ```
15 |
16 | During semi-naive evaluation, we will compute a delta relation $dP$ every iteration as
17 | $dP_{i+1} = E \bowtie dP_i - P_i$.
18 | The set difference can be done in linear time, so we will focus on the join only.
19 |
20 | If we look at *all* iterations, we'll be computing
21 | $E \bowtie dP_0 \cup E \bowtie dP_1 \cup \cdots \cup E \bowtie dP_{k-1}$.
22 | Factoring out `E`, we get $E \bowtie (dP_0 \cup \cdots \cup dP_{k-1}) = E \bowtie P_k$,
23 | where $P_k$ is the relation $P$ at iteration $k$.
24 | Since $P_k$ must be contained in the final output $O$, i.e. $|P_k| \leq |O|$,
25 | at this point we can say the whole semi-naive algorithm runs in $O(|E|\times|O|)$.
26 | But turns out we can do better.
27 |
28 | To reduce clutter, I'll write $P$ for $P_k$.
29 | Now take a closer look at the join `Q(x, y, z) :- E(x, z), P(z, y)`.
30 | For the moment, let's also add $O$ into the join to make a triangle `Q'(x, y, z) :- E(x, z), P(z, y), O(x, y)`.
31 | With this, we can use the AGM bound to get $O(|E|^\frac{1}{2} |P|^\frac{1}{2} |O|^\frac{1}{2}) \leq O(|E|^\frac{1}{2} |O|)$,
32 | which is a tighter bound than the above.
33 | I now claim we can also use this bound for $Q$.
34 | The key is that the execution of Generic Join for $Q'$ is exactly the same as that for $Q$.
35 |
36 | Consider the variable ordering `z, x, y`. The GJ loop for $Q'$ is the following:
37 |
38 | ```
39 | for z in E.z ^ P.z
40 | for x in E[z].x ^ O.x
41 | for y in P[z].y ^ O[x].y
42 | output(x, y, z)
43 | ```
44 |
45 | Since $O$ is the final output of the transitive closure program,
46 | we have the invariant $\forall x, y, z : (x, z) \in E \wedge (z, y) \in P \implies (x, y) \in O$.
47 | Therefore, we can remove the intersections with $O$ on both inner loops,
48 | and the run time would remain the same since $O$ does not filter out any value.
49 | With $O$ removed, the nested loop now computes exactly $Q$,
50 | taking the same time as $Q'$.
--------------------------------------------------------------------------------
/posts/sql-eq.md:
--------------------------------------------------------------------------------
1 | # How to Check 2 SQL Tables are the Same
2 |
3 | Today [Stanley](https://github.com/az15240)
4 | asked me a simple question:
5 | how can we check if the contents of two SQL tables are the same?
6 | Well, you just do `SELECT * FROM t1 = t2`... wait, that's wrong, comparison
7 | doesn't work on entire tables in SQL.
8 | My second attempt is a bit better: if we take the difference of the table both ways,
9 | and end up with empty results, then they must be the same, right?
10 | In SQL: `SELECT * FROM t1 EXCEPT SELECT * FROM t2` (and the other way).
11 | Wrong again! Because `EXCEPT` takes the *set* difference,
12 | it will be empty if, say, `t1` contains 2 copies of a tuple, but `t2` contains only one.
13 |
14 | I gave up a little bit and started searching online,
15 | but surprisingly there was not a single satisfying answer!
16 | The solutions online either suffer from the same issue as the `EXCEPT` query,
17 | or use some obscure features that are not standard SQL (e.g. `CHECKSUM` which doesn't really work anyways).
18 | How hard can it be to compare two tables in SQL?!
19 |
20 | Intrigued, I posted the problem as a challenge to my colleagues:
21 | **Write a query, using only standard SQL features, to check if two tables are the same**.
22 | Here "same" means the two table contains the same set of distinct tuples,
23 | and every tuple has the same number of copies in each table.
24 | Formally, they are the same bag/multiset.
25 |
26 | If you've read the SQL standard (and every "non-standard") cover to cover,
27 | you'll come with the following query after a few campari drinks:
28 | `SELECT * FROM t1 EXCEPT ALL SELECT * FROM t2`.
29 | The key is `EXCEPT ALL` which takes the "bag difference"
30 | similar to how `UNION ALL` takes the "bag union".
31 | Alas, `EXCEPT ALL` is not implemented by SQLite!
32 | And probably for good reasons:
33 | whereas `EXCEPT` can be compiled to just an anti-join,
34 | executing `EXCEPT ALL` probably requires keeping track of
35 | which copy of the same tuple we've seen,
36 | or keeping a count per distinct tuple.
37 |
38 | A more "vanilla SQL" solution looks like this:
39 | ```SQL
40 | SELECT *, COUNT(*)
41 | FROM t1
42 | GROUP BY x, y, z, ... -- all attributes of t1
43 |
44 | EXCEPT
45 |
46 | SELECT *, COUNT(*)
47 | FROM t2
48 | GROUP BY x, y, z, ... -- all attributes of t2
49 | ```
50 | Here, we group by all attributes of the table
51 | in order to explicitly mark every distinct tuple with its count.
52 | And because all tuples are distinct after the grouping,
53 | we can use `EXCEPT` to compare the results.
54 | That's pretty good! I should be happy about it and go back to work.
55 |
56 | But I can't get over one small ugliness: I had to manually
57 | list all the attributes in the `GROUP BY` clause,
58 | since `GROUP BY *` doesn't work.
59 | This means we have to change the query for every new schema.
60 | "Fine," you say, "just generate the query and get back to work".
61 | Problem is, I don't feel like working today, so I invite myself
62 | to another challenge: **write a _single_ query that does the job
63 | for every pair of tables, where we are only allowed to change the table names**.
64 |
65 | TBH it's not surprising I got nerd sniped by this problem:
66 | [half](https://dl.acm.org/doi/10.14778/3407790.3407799)
67 | [of](https://egraphs-good.github.io)
68 | [my](https://github.com/uwplse/tensat)
69 | [PhD](https://arxiv.org/abs/2108.02290)
70 | [dealt](https://arxiv.org/abs/2202.10390)
71 | [with](https://remy.wang/reports/dfta.pdf)
72 | [equivalence](https://dl.acm.org/doi/abs/10.1145/3591239),
73 | and one idea in fact led to the final solution.
74 | **This key idea is to view a table in bag semantics
75 | as a vector of numbers,
76 | and view joins of tables as polynomials**.
77 | Specifically, consider sorting all the distinct elements,
78 | and the `i`th entry of the vector stores the
79 | count of the `i`th distinct element.
80 | For example, the table `t=[a, b, b, c, c]` becomes the vector `[1 2 2]`.
81 | Then, a self-join becomes point-wise multiplication of the vector with itself.
82 | Using the same example, `t NATURAL JOIN t` contains 1 copy of `a`, 4 copies of `b`,
83 | and 4 copies of `c`, `[1 4 4] = [1 2 2] * [1 2 2]`.
84 |
85 | With this, we can connect repeated self-joins of a table with
86 | the [moments](https://en.wikipedia.org/wiki/Moment_(mathematics))
87 | of the vector (or [power sum](https://en.wikipedia.org/wiki/Newton%27s_identities), or [p-norm](https://en.wikipedia.org/wiki/Norm_(mathematics)#p-norm), if you're more familiar with those).
88 | That is, the query:
89 | ```SQL
90 | SELECT COUNT(*)
91 | FROM t NATURAL JOIN t
92 | NATURAL JOIN t
93 | ... -- total of p copies of t's
94 | NATURAL JOIN t
95 | ```
96 | computes $\sum_{i=1}^n v_i^p$, where $v$ is the vector representation of `t`
97 | and $n$ is its length, i.e. the number of distinct elements in `t`.
98 | Abusing notation, we'll write that as $||v||_p$
99 |
100 | The above connection lets us use a very elegant result:
101 | for any two vectors $u, v$ of length $n$,
102 | if $\forall 1 \leq p \leq n : ||u||_p = ||v||_p$
103 | then $u$ must be a permutation of $v$.
104 | In other words, the $n$ moments uniquely determines
105 | a bag of values!
106 | See Appendix A of [Abo Khamis et.al.](https://arxiv.org/abs/2306.14075)
107 | for a very elegant proof.
108 |
109 | Our game plan is now this: for each table, compute all $n$ moments
110 | and compare the results.
111 | We can do this with a recursive query:
112 | ```SQL
113 | CREATE TABLE t1_moments AS
114 |
115 | WITH RECURSIVE r1 AS (
116 | -- first iteration, return t1 as-is
117 | SELECT 1 as i, t1.*
118 | FROM t1
119 |
120 | UNION ALL
121 |
122 | -- iterations i+1 joins together i+1 copies of t1
123 | SELECT r1.i + 1 AS i, t1.*
124 | FROM r1 NATURAL JOIN t1
125 | -- we could have stopped at COUNT(DISTINCT *) ...
126 | -- but that's not valid SQL :(
127 | WHERE i < (SELECT COUNT(*) FROM t1)
128 | )
129 |
130 | -- compute the moment with COUNT
131 | SELECT COUNT(*) FROM r1 GROUP BY i;
132 | ```
133 | After computing `t2_moments` in the same way, we can compare them
134 | with `EXCEPT` because they do not contain duplicates.
135 |
136 | But that's not enough, since having the same moments only guarantees
137 | the vectors are permutations of each other.
138 | In terms of the original relation,
139 | the table `[a, b, b]` will be indistinguishable from the table `[a, a, b]`,
140 | because `[1 2]` has the same moments as `[2 1]`.
141 | To rule out this case, we use a simple fact from linear algebra:
142 | if $v$ is a permutation of $u$
143 | and $v \neq u$,
144 | then $v\cdot u < u\cdot u$.
145 | In SQL, this means we need to take the natural join of `t1` with `t2` and
146 | compare the count with the self join of `t1` (and of `t2`):
147 |
148 | ```SQL
149 | SELECT (SELECT COUNT(*) FROM t1 NATURAL JOIN t1)
150 | - (SELECT COUNT(*) FROM t1 NATURAL JOIN t2)
151 | AS d WHERE d <> 0;
152 |
153 | SELECT (SELECT COUNT(*) FROM t2 NATURAL JOIN t2)
154 | - (SELECT COUNT(*) FROM t1 NATURAL JOIN t2)
155 | AS d WHERE d <> 0;
156 | ```
157 |
158 | See the complete query at the end of this post.
159 | All together, the query uses only standard SQL features,
160 | and to use it for a new pair of tables we only need to change
161 | the table names.
162 | Of course, it is completely impractical for any table
163 | of decent size (it runs in time $O(N^N)$ ), but that's not the point :)
164 |
165 | But even the simpler query using `GROUP BY` was not trivial to come up with,
166 | which brings the question: why isn't it a standard feature of SQL to just
167 | compare two tables?
168 | I imagine it can be very useful for testing, e.g. you write a simple but slow
169 | version, and check a more complex but fast version returns the same result.
170 |
171 | ---
172 | ```SQL
173 | CREATE TABLE t1 (x INTEGER);
174 | CREATE TABLE t2 (x INTEGER);
175 |
176 | INSERT INTO t1 VALUES (1), (1), (2), (3);
177 | INSERT INTO t2 VALUES (2), (1), (3), (2);
178 |
179 | -- If t1=t2 (meaning they are the same bag/multiset), then the following should return nothing.
180 |
181 | -- Sanity check: do they contain the same *set* of elements (ignoring duplicates)?
182 |
183 | SELECT * FROM t1 EXCEPT SELECT * FROM t2;
184 | SELECT * FROM t2 EXCEPT SELECT * FROM t1;
185 |
186 | -- Now compare the moments/power sums
187 |
188 | CREATE TABLE t1_moments AS
189 |
190 | WITH RECURSIVE r1 AS (
191 | -- First iteration, return t1 as-is
192 | SELECT 1 AS i, t1.*
193 | FROM t1
194 |
195 | UNION ALL
196 |
197 | -- Iterations i+1 joins together i+1 copies of t1
198 | SELECT r1.i + 1 AS i, t1.*
199 | FROM r1 NATURAL JOIN t1
200 | -- We could have stopped at |t1| (number of distinct elements in t1)
201 | WHERE i < (SELECT COUNT(*) FROM t1)
202 | )
203 |
204 | -- Compute the power sum with COUNT
205 | SELECT COUNT(*) FROM r1 GROUP BY i;
206 |
207 |
208 | -- Repeat for the other table...
209 | CREATE TABLE t2_moments AS
210 |
211 | WITH RECURSIVE r2 AS (
212 | SELECT 1 AS i, t2.*
213 | FROM t2
214 |
215 | UNION ALL
216 |
217 | SELECT r2.i + 1 AS i, t2.*
218 | FROM r2 NATURAL JOIN t2
219 | WHERE i < (SELECT COUNT(*) FROM t2)
220 | )
221 |
222 | SELECT COUNT(*) FROM r2 GROUP BY i;
223 |
224 | SELECT * FROM t1_moments EXCEPT SELECT * FROM t2_moments;
225 | SELECT * FROM t2_moments EXCEPT SELECT * FROM t1_moments;
226 |
227 | -- To rule out the case where they have the same moments, but are "permutations" of each other
228 | SELECT (SELECT COUNT(*) FROM t1 NATURAL JOIN t1) - (SELECT COUNT(*) FROM t1 NATURAL JOIN t2) AS d WHERE d <> 0;
229 | SELECT (SELECT COUNT(*) FROM t2 NATURAL JOIN t2) - (SELECT COUNT(*) FROM t1 NATURAL JOIN t2) AS d WHERE d <> 0;
230 | ```
231 |
--------------------------------------------------------------------------------
/posts/ssh-image.md:
--------------------------------------------------------------------------------
1 | # View Images on A Remote Machine
2 |
3 | First connect to remate machine with port-forwarding: `ssh -L
4 | 8000:localhost:8000 usr@machine.cs.school.edu`, `cd` to project directory, then
5 | serve the whole directory with `python3 -m http.server &> /dev/null &`. Use a web browser to
6 | connect to `http://0.0.0.0:8000`.
7 |
--------------------------------------------------------------------------------
/posts/wcoj.md:
--------------------------------------------------------------------------------
1 | # Generic Join Algorithms
2 |
3 | This post introduces the basics of the worst-case optimal Generic Join algorithm.
4 |
5 | ## The AGM Bound
6 |
7 | Given a database (set of relations) and a query (natural join of relations),
8 | we want to know how large the query output can be.
9 | A really stupid bound just multiplies the size of each relation,
10 | because that is the size of their cartesian product.
11 | That is, given the query
12 | ```math
13 | Q(x,y,z) \leftarrow R(x,y), S(y,z), T(x,z).
14 | ```
15 | we have the bound $|Q| \leq |R| \times |S| \times |T|$.
16 |
17 | If $|R|=|S|=|T|=N$, then $|Q| \leq N^3$.
18 | We can do better - at least we know $|Q| \leq |R| \times |S| = N^2$.
19 | That is because Q contains fewer tuples than the query
20 | ```math
21 | Q’(x,y,z) \leftarrow R(x,y), S(y,z).
22 | ```
23 | since Q further joins with T.
24 |
25 | The best possible theoretical bound is the AGM bound,
26 | which is $N^{3/2}$ for Q.
27 | It’s computed from the fractional edge cover of the query hypergraph.
28 |
29 | ## Query Hypergraph
30 |
31 | The hypergraph of a query is simply the hypergraph
32 | where the vertices are the variables and the edges are the relations.
33 |
34 | Q’s hypergraph looks like this:
35 |
36 | 
37 |
38 | A non-binary (e.g. ternary) relation will become a hyperedge
39 | that connects more than 2 vertices.
40 |
41 | ## Fractional Edge Cover
42 |
43 | A set of edges *cover* a graph if they touch all vertices.
44 | For Q’s hypergraph, any two edges form a cover.
45 | A *fractional edge cover* assigns a weight to each edge from 0 to 1;
46 | the weight of a vertex is the sum of the edges it touches.
47 | In a fractional cover, every vertex must have weight at least 1.
48 | Every edge cover is a fractional cover,
49 | because we just assign 1 to every edge in the cover
50 | and 0 to edges not in the cover.
51 | For Q’s hypergraph,
52 | $R \rightarrow 1/2, S \rightarrow 1/2, T \rightarrow 1/2$ is also a fractional cover.
53 | The AGM bound is defined to be $\min_{w_1,w_2,w_3} |R|^{w_1}|S|^{w_2}|T|^{w_3}$
54 | where $R \rightarrow w_1, S \rightarrow w_2, T \rightarrow w_3$ is a fractional cover.
55 | This is the upper bound of Q’s output size;
56 | i.e. in the worst case Q outputs this many tuples.
57 |
58 | ## Generic Join
59 |
60 | We would like an algorithm that runs in time linear to the worst case output size,
61 | and Generic Join is such an algorithm (with a log factor).
62 | It has one parameter: a global variable ordering.
63 | It is an ordering of the set of Q’s variables, say $[x,y,z]$.
64 | Any ordering works (achieves the worst-case complexity) in theory,
65 | but different ordering performs very differently in practice.
66 | Given an ordering,
67 | we assume the input relations are stored in tries sorted by the ordering.
68 | That is, given the ordering $[x,y,z]$,
69 | $R(x,y)$ is sorted by x and then y,
70 | and the first-level trie nodes are the x’s.
71 |
72 | Even more concretely, if $R=\{(3, 4), (2, 1), (2, 5)\}$,
73 | then its trie looks like this (each box is sorted):
74 |
75 | 
76 |
77 | We can very efficiently intersect (join) relations on their
78 | first (according to the variable ordering) variable given such tries.
79 |
80 | The Generic Join algorithm for computing Q is as follows:
81 |
82 | ```
83 | Q(x,y,z) = R(x,y),S(y,z),T(z,x)
84 | # variable ordering [x,y,z]
85 |
86 | A= R(x,y).x ∩ T(z,x).x
87 | for a in A do
88 | # compute Q(a,y,z) = R(a,y),S(y,z),T(z,a)
89 | B= R(a,y).y ∩ S(y,z).y
90 | for b in B do
91 | # compute Q(a,b,z) = R(a,b),S(b,z),T(z,a)
92 | C= S(b,z).z ∩ T(z,a).z
93 | for c in C do
94 | output (a,b,c)
95 | ```
96 |
97 | Note that selection, e.g. $R(a, y)$ is free / very fast because we have the tries.
98 | $A \cap B$ can also be done in $\tilde{O}(\min(|A|, |B|))$ time
99 | ($\tilde{O}$ means O with a log factor).
100 | For general queries we may have to intersect more than 2 relations,
101 | in which case the intersection must be performed in
102 | $\tilde{O}(\min_i|A_i|)$ time (using the merge in merge-sort).
103 |
104 | ## When to Use Generic Join
105 |
106 | Use Generic Join when the query is cyclic - if the query is acyclic,
107 | the theoretical optimal algorithm is Yannakakis
108 | (equivalent to message passing in probabilistic graphical models)
109 | which runs in linear time in the input size;
110 | in practice people use binary joins for acyclic queries,
111 | because Yannakakis always incurs a constant factor of 3 (it’s a 3-pass algorithm).
112 |
113 | In practice, Generic Join may have good performance compared to binary joins,
114 | since it is sometimes equivalent to a linear join tree.
115 | The “average case” complexity of Generic Join is open.
116 |
117 | ## References
118 |
119 | Hung Q Ngo, Christopher Ré, and Atri Rudra. 2014. Skew strikes back: new developments in the theory of join algorithms. SIGMOD Rec. 42, 4 (December 2013), 5–16. DOI:https://doi.org/10.1145/2590989.2590991
120 |
121 | Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra. 2012. Worst-case optimal join algorithms: [extended abstract]. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems (PODS '12). Association for Computing Machinery, New York, NY, USA, 37–48. DOI:https://doi.org/10.1145/2213556.2213565
122 |
123 | Veldhuizen, Todd L. "Leapfrog triejoin: A simple, worst-case optimal join algorithm." arXiv preprint arXiv:1210.0481 (2012).
124 |
125 | Mihalis Yannakakis. 1981. Algorithms for acyclic database schemes. In Proceedings of the seventh international conference on Very Large Data Bases - Volume 7 (VLDB '81). VLDB Endowment, 82–94.
126 |
--------------------------------------------------------------------------------