You can access the slides from a recent presentation here :
We present an elementary branch-and-bound algorithm with a simple analysis of why it achieves worstcase optimality for join queries on classes of databases defined respectively by cardinality or acyclic degree constraints. We then show that if one is given a reasonable way for recursively estimating upper bounds on the number of answers of the join queries, our algorithm can be turned into algorithm for uniformly sampling answers with expected running time \(\tilde{\mathcal{O}}(\mathsf{UP}/\mathsf{OUT})\) where \(\mathsf{UP}\) is the upper bound, \(\mathsf{OUT}\) is the actual number of answers and \(\tilde{\mathcal{O}}(\cdot)\) ignores polylogarithmic factors. Our approach recovers recent results on worstcase optimal join algorithm and sampling in a modular, clean and elementary way.
We’ll start by looking into different ways of joining relations to answer queries.
Consider the following example database:
R | x | y |
---|---|---|
0 | 0 | |
0 | 1 | |
2 | 1 |
S | x | z |
---|---|---|
0 | 0 | |
0 | 2 | |
2 | 3 |
T | y | z |
---|---|---|
0 | 2 | |
1 | 0 | |
1 | 2 |
and the following query: \(Q = R(x, y) \land S(x, z) \land T(y, z)\). Joining the good way could imply starting by devising a query plan, such as first, joining \(R\) and \(S\), and using the result of this to join to \(T\). Once the query plan has been devised from the query and the database, we can materialise the intermediary joins.
In this case, this would lead to:
R ⋈ S | x | y | z |
---|---|---|---|
0 | 0 | 0 | |
0 | 0 | 2 | |
0 | 1 | 0 | |
0 | 1 | 2 | |
2 | 1 | 3 |
R ⋈ S ⋈ T | x | y | z |
---|---|---|---|
0 | 0 | 2 | |
0 | 1 | 0 | |
0 | 1 | 2 |
Suppose that we can bound the size of all relations by a
given integer \(N\), that is, \(|R^{\mathbb{D}}|, |S^{\mathbb{D}}|, |T^{\mathbb{D}}| \leqslant N\). Then, from previous work by Atserias, Grohe and Marx, we know that the size of the query answer set cannot be greater than \(N^{1.5}\)
With this joining method, taking solely \(R \bowtie S\), we could have a intermediary size of \(N^2\) answers.
For instance, consider a database \(\mathbb{D}\) on domain \(D = D_1 \uplus D_2 \uplus D_3\) such that \(0 \notin D\) and \(|D_1| = |D_2| = |D_3| = N\).
R | x | y |
---|---|---|
0 | $$D_2$$ | |
$$D_1$$ | 0 |
S | x | z |
---|---|---|
0 | $$D_3$$ | |
$$D_1$$ | 0 |
T | y | z |
---|---|---|
0 | $$D_3$$ | |
$$D_2$$ | 0 |
In this case, whatever the intermediary join, the size is at least quadratic, \(|R^\mathbb{D}\bowtie S^\mathbb{D}| \geqslant N^2\), \(|S^\mathbb{D}\bowtie T^\mathbb{D}| \geqslant N^2\) and \(|R^\mathbb{D}\bowtie T^\mathbb{D}| \geqslant N^2\), but no answer exists for this query: \(|Q(\mathbb{D})| = 0\).
Join queries are expressions of the form \(Q := R_1(\mathbf{x_1}), \dots, R_m(\mathbf{x_m})\)
where every \(R_i\) is a relation symbol and the \(\mathbf{x_i}\) is a tuple of variables
over a given set \(X\).
The problem of simply deciding whether a join query has
an answer for a given database is NP-complete
Thus, we consider the problem of finding the answers of a query \(Q\) in time proportional to the size of the worst possible database for \(Q\), that is, the database that has the most possible answers to \(Q\).
This line of research is therefore about finding a worst-case (because of the worst possible database) optimal join algorithm.
Ngo, Porat, Ré and Rudra proposed the first WCOJ algorithm for
queries defined by cardinality constraints (that is each relation has
a bounded size)
Consider the triangle query:
\[Q_\Delta = R(x_1, x_2), S(x_2, x_3), T(x1, x_3)\]If we suppose that \(|R| \leqslant N\), \(|S| \leqslant N\) and \(|T| \leqslant N\), then,
from the work of Atserias, Grohe and Marx
An algorithm finding the answers of \(Q_\Delta\) could therefore be worst case optimal if it runs in time \(\tilde{\mathcal{O}}(N^{3/2})\).
\(\newcommand{\H}{\mathcal{H}}\) \(\newcommand{\C}{\mathcal{C}}\) \(\newcommand{\wc}{\mathsf{wc}}\) \(\newcommand{\ans}{\mathsf{ans}}\) \(\newcommand{\ot}{\tilde{\mathcal{O}}}\) Let \(\H(X, E)\) be a hypergraph and \(\C\) be a class of join queries with hypergraph \(\H\). We define the worst case of \(\C\) as \(\wc(\C) = \mathsf{sup}_{Q\in\C}|\ans(Q)|\).
An algorithm is a worst case optimal join for \(\C\) if for a query \(Q \in \C\), it outputs \(\ans(Q)\) in time \(\ot(\wc(\C) \times \mathsf{poly}(nm))\) with \(m = |E|\) and \(n = |X|\).
Our algorithm is a simple branch-and-bound which assigns a variable to any possible domain value and backtracks whenever an inconsistency is detected.
WCJ(Q: query, t: partial tuple)
1. if Q[t] contains an empty relation -> return
2. i = last assigned variable in t
3. if i = n -> output t, return
4. for d in domain -> WCJ(Q, t + [x_i+1 <- d])
Consider for example the triange query \(Q_\Delta\) over the following database:
$$R$$ | $$x_1$$ | $$x_2$$ |
---|---|---|
0 | 0 | |
1 | 0 | |
1 | 1 | |
2 | 1 |
$$S$$ | $$x_2$$ | $$x_3$$ |
---|---|---|
0 | 2 | |
0 | 3 | |
1 | 0 | |
1 | 2 |
$$T$$ | $$x_1$$ | $$x_3$$ |
---|---|---|
0 | 3 | |
1 | 0 | |
1 | 2 | |
2 | 3 |
Then, for example, we would start by looking at what happens when we assign \(x_1 \gets 0\) and so on.