We present an elementary branch-and-bound algorithm with a simple analysis of why it achieves worstcase optimality for join queries on classes of databases defined respectively by cardinality or acyclic degree constraints. We then show that if one is given a reasonable way for recursively estimating upper bounds on the number of answers of the join queries, our algorithm can be turned into algorithm for uniformly sampling answers with expected running time \(\tilde{\mathcal{O}}(\mathsf{UP}/\mathsf{OUT})\) where \(\mathsf{UP}\) is the upper bound, \(\mathsf{OUT}\) is the actual number of answers and \(\tilde{\mathcal{O}}(\cdot)\) ignores polylogarithmic factors. Our approach recovers recent results on worstcase optimal join algorithm and sampling in a modular, clean and elementary way.
Join queries are expressions of the form \(Q := R_1(\mathbf{x_1}), \dots, R_m(\mathbf{x_m})\)
where every \(R_i\) is a relation symbol and the \(\mathbf{x_i}\) is a tuple of variables
over a given set \(X\).
The problem of simply deciding whether a join query has
an answer for a given database is NP-complete
Thus, we consider the problem of finding the answers of a query \(Q\) in time proportional to the size of the worst possible database for \(Q\), that is, the database that has the most possible answers to \(Q\).
This line of research is therefore about finding a worst-case (because of the worst possible database) optimal join algorithm.
Ngo, Porat, Ré and Rudra proposed the first WCOJ algorithm for
queries defined by cardinality constraints (that is each relation has
a bounded size)
Consider the triangle query:
\[Q_\Delta = R(x_1, x_2), S(x_2, x_3), T(x1, x_3)\]If we suppose that \(|R| \leqslant N\), \(|S| \leqslant N\) and \(|T| \leqslant N\), then,
from the work of Atserias, Grohe and Marx
An algorithm finding the answers of \(Q_\Delta\) could therefore be worst case optimal if it runs in time \(\tilde{\mathcal{O}}(N^{3/2})\).
\(\newcommand{\H}{\mathcal{H}}\) \(\newcommand{\C}{\mathcal{C}}\) \(\newcommand{\wc}{\mathsf{wc}}\) \(\newcommand{\ans}{\mathsf{ans}}\) \(\newcommand{\ot}{\tilde{\mathcal{O}}}\) Let \(\H(X, E)\) be a hypergraph and \(\C\) be a class of join queries with hypergraph \(\H\). We define the worst case of \(\C\) as \(\wc(\C) = \mathsf{sup}_{Q\in\C}|\ans(Q)|\).
An algorithm is a worst case optimal join for \(\C\) if for a query \(Q \in \C\), it outputs \(\ans(Q)\) in time \(\ot(\wc(\C) \times \mathsf{poly}(nm))\) with \(m = |E|\) and \(n = |X|\).
Our algorithm is a simple branch-and-bound which assigns a variable to any possible domain value and backtracks whenever an inconsistency is detected.
Consider for example the triange query \(Q_\Delta\) over the following database:
$$R$$ | $$x_1$$ | $$x_2$$ |
---|---|---|
0 | 0 | |
1 | 0 | |
1 | 1 | |
2 | 1 |
$$S$$ | $$x_2$$ | $$x_3$$ |
---|---|---|
0 | 2 | |
0 | 3 | |
1 | 0 | |
1 | 2 |
$$T$$ | $$x_1$$ | $$x_3$$ |
---|---|---|
0 | 3 | |
1 | 0 | |
1 | 2 | |
2 | 3 |
Then, for example, we would start by looking at what happens when we assign \(x_1 \gets 0\) and so on.