A Simple Algorithm for Worst-Case Optimal Join and Sampling

You can access the slides from a recent presentation here :

Abstract

Click here to read the paper abstract

We present an elementary branch-and-bound algorithm with a simple analysis of why it achieves worstcase optimality for join queries on classes of databases defined respectively by cardinality or acyclic degree constraints. We then show that if one is given a reasonable way for recursively estimating upper bounds on the number of answers of the join queries, our algorithm can be turned into algorithm for uniformly sampling answers with expected running time $\tilde{\mathcal{O}}(\mathsf{UP}/\mathsf{OUT})$ where $\mathsf{UP}$ is the upper bound, $\mathsf{OUT}$ is the actual number of answers and $\tilde{\mathcal{O}}(\cdot)$ ignores polylogarithmic factors. Our approach recovers recent results on worstcase optimal join algorithm and sampling in a modular, clean and elementary way.

Joining relations

We’ll start by looking into different ways of joining relations to answer queries.

Joining, the good way

Consider the following example database:

R	x	y
	0	0
	0	1
	2	1

S	x	z
	0	0
	0	2
	2	3

T	y	z
	0	2
	1	0
	1	2

and the following query: $Q = R(x, y) \land S(x, z) \land T(y, z)$. Joining the good way could imply starting by devising a query plan, such as first, joining $R$ and $S$, and using the result of this to join to $T$. Once the query plan has been devised from the query and the database, we can materialise the intermediary joins.

In this case, this would lead to:

x	y	z
0	0	0
0	0	2
0	1	0
0	1	2
2	1	3

→

R ⋈ S ⋈ T	x	y	z
	0	0	2
	0	1	0
	0	1	2

What’s wrong with this?

Suppose that we can bound the size of all relations by a given integer $N$, that is, $|R^{\mathbb{D}}|, |S^{\mathbb{D}}|, |T^{\mathbb{D}}| \leqslant N$. Then, from previous work by Atserias, Grohe and Marx, we know that the size of the query answer set cannot be greater than $N^{1.5}$</dcite>.

With this joining method, taking solely $R \bowtie S$, we could have a intermediary size of $N^2$ answers.

For instance, consider a database $\mathbb{D}$ on domain $D = D_1 \uplus D_2 \uplus D_3$ such that $0 \notin D$ and $|D_1| = |D_2| = |D_3| = N$.

R	x	y
	0	$$D_2$$
	$$D_1$$	0

S	x	z
	0	$$D_3$$
	$$D_1$$	0

T	y	z
	0	$$D_3$$
	$$D_2$$	0

In this case, whatever the intermediary join, the size is at least quadratic, $|R^\mathbb{D}\bowtie S^\mathbb{D}| \geqslant N^2$, $|S^\mathbb{D}\bowtie T^\mathbb{D}| \geqslant N^2$ and $|R^\mathbb{D}\bowtie T^\mathbb{D}| \geqslant N^2$, but no answer exists for this query: $|Q(\mathbb{D})| = 0$.

Visual representation of the evolution of the number of answers between the intermediary join and the full join. The $$N^{1.5}$$ line indicates the upper bound on the possible number of answers to the query.

Worst-Case Optimal Joins

Join queries are expressions of the form $Q := R_1(\mathbf{x_1}), \dots, R_m(\mathbf{x_m})$ where every $R_i$ is a relation symbol and the $\mathbf{x_i}$ is a tuple of variables over a given set $X$. The problem of simply deciding whether a join query has an answer for a given database is NP-complete, it is therefore unlikely that one can list all of the answers in time linear in the size of the number of answers.

Thus, we consider the problem of finding the answers of a query $Q$ in time proportional to the size of the worst possible database for $Q$, that is, the database that has the most possible answers to $Q$.

This line of research is therefore about finding a worst-case (because of the worst possible database) optimal join algorithm.

Ngo, Porat, Ré and Rudra proposed the first WCOJ algorithm for queries defined by cardinality constraints (that is each relation has a bounded size). A simplified branch-and-bound algorithm, Leapfrog TrieJoin, has been proposed by Veldhuizen and a more generic version, GenericJoin was introduced by Ngo.

Quick example

Consider the triangle query:

\[Q_\Delta = R(x_1, x_2), S(x_2, x_3), T(x1, x_3)\]

If we suppose that $|R| \leqslant N$, $|S| \leqslant N$ and $|T| \leqslant N$, then, from the work of Atserias, Grohe and Marx, we know that the optimal bound for the number of answers of $Q_\Delta$ is $N^{3/2}$.

An algorithm finding the answers of $Q_\Delta$ could therefore be worst case optimal if it runs in time $\tilde{\mathcal{O}}(N^{3/2})$.

General complexity

$\newcommand{\H}{\mathcal{H}}$ $\newcommand{\C}{\mathcal{C}}$ $\newcommand{\wc}{\mathsf{wc}}$ $\newcommand{\ans}{\mathsf{ans}}$ $\newcommand{\ot}{\tilde{\mathcal{O}}}$ Let $\H(X, E)$ be a hypergraph and $\C$ be a class of join queries with hypergraph $\H$. We define the worst case of $\C$ as $\wc(\C) = \mathsf{sup}_{Q\in\C}|\ans(Q)|$.

An algorithm is a worst case optimal join for $\C$ if for a query $Q \in \C$, it outputs $\ans(Q)$ in time $\ot(\wc(\C) \times \mathsf{poly}(nm))$ with $m = |E|$ and $n = |X|$.

Our algorithm

Our algorithm is a simple branch-and-bound which assigns a variable to any possible domain value and backtracks whenever an inconsistency is detected.

WCJ(Q: query, t: partial tuple)
  1. if Q[t] contains an empty relation -> return
  2. i = last assigned variable in t
  3. if i = n -> output t, return
  4. for d in domain -> WCJ(Q, t + [x_i+1 <- d])

Consider for example the triange query $Q_\Delta$ over the following database:

$$R$$	$$x_1$$	$$x_2$$
	0	0
	1	0
	1	1
	2	1

$$S$$	$$x_2$$	$$x_3$$
	0	2
	0	3
	1	0
	1	2

$$T$$	$$x_1$$	$$x_3$$
	0	3
	1	0
	1	2
	2	3

Then, for example, we would start by looking at what happens when we assign $x_1 \gets 0$ and so on.