A Simple Algorithm for Worst-Case Optimal Join and Sampling

Abstract

Click here to read the paper abstract

We present an elementary branch-and-bound algorithm with a simple analysis of why it achieves worstcase optimality for join queries on classes of databases defined respectively by cardinality or acyclic degree constraints. We then show that if one is given a reasonable way for recursively estimating upper bounds on the number of answers of the join queries, our algorithm can be turned into algorithm for uniformly sampling answers with expected running time \(\tilde{\mathcal{O}}(\mathsf{UP}/\mathsf{OUT})\) where \(\mathsf{UP}\) is the upper bound, \(\mathsf{OUT}\) is the actual number of answers and \(\tilde{\mathcal{O}}(\cdot)\) ignores polylogarithmic factors. Our approach recovers recent results on worstcase optimal join algorithm and sampling in a modular, clean and elementary way.

Worst-Case Optimal Joins

Join queries are expressions of the form \(Q := R_1(\mathbf{x_1}), \dots, R_m(\mathbf{x_m})\) where every \(R_i\) is a relation symbol and the \(\mathbf{x_i}\) is a tuple of variables over a given set \(X\). The problem of simply deciding whether a join query has an answer for a given database is NP-complete, it is therefore unlikely that one can list all of the answers in time linear in the size of the number of answers.

Thus, we consider the problem of finding the answers of a query \(Q\) in time proportional to the size of the worst possible database for \(Q\), that is, the database that has the most possible answers to \(Q\).

This line of research is therefore about finding a worst-case (because of the worst possible database) optimal join algorithm.

Ngo, Porat, Ré and Rudra proposed the first WCOJ algorithm for queries defined by cardinality constraints (that is each relation has a bounded size). A simplified branch-and-bound algorithm, Leapfrog TrieJoin, has been proposed by Veldhuizen and a more generic version, GenericJoin was introduced by Ngo.

Quick example

Consider the triangle query:

\[Q_\Delta = R(x_1, x_2), S(x_2, x_3), T(x1, x_3)\]

If we suppose that \(|R| \leqslant N\), \(|S| \leqslant N\) and \(|T| \leqslant N\), then, from the work of Atserias, Grohe and Marx, we know that the optimal bound for the number of answers of \(Q_\Delta\) is \(N^{3/2}\).

An algorithm finding the answers of \(Q_\Delta\) could therefore be worst case optimal if it runs in time \(\tilde{\mathcal{O}}(N^{3/2})\).

General complexity

\(\newcommand{\H}{\mathcal{H}}\) \(\newcommand{\C}{\mathcal{C}}\) \(\newcommand{\wc}{\mathsf{wc}}\) \(\newcommand{\ans}{\mathsf{ans}}\) \(\newcommand{\ot}{\tilde{\mathcal{O}}}\) Let \(\H(X, E)\) be a hypergraph and \(\C\) be a class of join queries with hypergraph \(\H\). We define the worst case of \(\C\) as \(\wc(\C) = \mathsf{sup}_{Q\in\C}|\ans(Q)|\).

An algorithm is a worst case optimal join for \(\C\) if for a query \(Q \in \C\), it outputs \(\ans(Q)\) in time \(\ot(\wc(\C) \times \mathsf{poly}(nm))\) with \(m = |E|\) and \(n = |X|\).

Our algorithm

Our algorithm is a simple branch-and-bound which assigns a variable to any possible domain value and backtracks whenever an inconsistency is detected.

WCJ(Q: query, t: partial tuple)
  1. if Q[t] contains an empty relation -> return
  2. i = last assigned variable in t
  3. if i = n -> output t, return
  4. for d in domain -> WCJ(Q, t + [x_i+1 <- d])

Consider for example the triange query \(Q_\Delta\) over the following database:

$$R$$ $$x_1$$ $$x_2$$
0 0
1 0
1 1
2 1
$$S$$ $$x_2$$ $$x_3$$
0 2
0 3
1 0
1 2
$$T$$ $$x_1$$ $$x_3$$
0 3
1 0
1 2
2 3

Then, for example, we would start by looking at what happens when we assign \(x_1 \gets 0\) and so on.