A Simple Algorithm for WCOJ & Sampling

Florent Capelli[1], Oliver Irwin[2], Sylvain Salvati[2]

25/03/2025 - ICDT 2025

[1] - CRIL / Université d’Artois

[2] - CRIStAL / Université de Lille

A Simple Join Algorithm

A simple algorithm for joins

\(Q \coloneq R(x, y) \wedge S(x, z) \wedge T(y, z)\)

R x y
0 0
0 1
2 1
S x z
0 0
0 2
2 3
T y z
0 2
1 0
1 2

A simple algorithm for joins

What about the complexity of this algorithm?

Complexity analysis

Now we have to bound how many calls we make

Number of calls

We get one call for each node of our graph. Two cases arise:

a prefix \(\tau = x_1\gets d_1, \dots, x_i \gets d_i\) is consistent, that is \(\tau \in \mathsf{ans}(Q_i)\).

\(\sum_{i\leqslant n}|\mathsf{ans}(Q_i)|\) calls

a prefix \(\tau\) is not consistent. But it comes from a prefix \(\tau'\) that is consistent, and for each \(\tau'\) we have \(|\mathsf{dom}|\) possibilities.

\(|\mathsf{dom}| \cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|\) calls

\((|\mathsf{dom}| + 1) \cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|\) calls

\[Q_i = \bigwedge_{R\in Q}\prod_{x_1\dots x_i} R\]

Algorithm Complexity

The complexity of the branch and bound algorithm is \[ \tilde{\mathcal{O}}(m|\mathsf{dom}|\cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|) \]

Worst-Case Optimality

The Holy Grail of Join Algorithms

For \(Q = \bigwedge_i R_i\), we consider the classes of queries \(\mathcal{C}[\leqslant N]\) such that \[ \mathcal{C}[\leqslant N]= \{\mathbb{D}\mid \forall i, |R_i^\mathbb{D}| \leqslant N\} \]

We can then define the worst case as: \[ \mathsf{wc}(Q, N) = \mathsf{sup}_{\mathbb{D}\in\mathcal{C}[\leqslant N]}(|\mathsf{ans}(Q, \mathbb{D})|) \]

Quick example

\(Q_\Delta = R(x, y) \wedge S(x, z) \wedge T(y, z)\)

\(\mathsf{wc}(Q_\Delta, N) = \tilde{\mathcal{O}}(N^{3/2})\)

WCO for cardinality constraints

\[\begin{align} |\mathsf{ans}(Q_i)| &= |\mathsf{ans}(\bigwedge_{R\in Q}\prod_{x_1\dots x_i} R)|\\ &= |\mathsf{ans}(\underbrace{\bigwedge_{R\in Q}\prod_{x_1\dots x_i} R \times \{0\}^{X_R\setminus \{x_1\dots x_i\}})}_{\in\ \mathcal{C}[\leqslant N]}| \leqslant \mathsf{wc}(\mathcal{C}[\leqslant N]) \end{align}\]

The complexity of the branch and bound algorithm is

\[ \tilde{\mathcal{O}}(m|\mathsf{dom}|\cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|) \]

\[ \tilde{\mathcal{O}}(m|\mathsf{dom}|\cdot n \cdot \mathsf{wc}(Q, N)) \]

Now we remove the \(|\mathsf{dom}|\) factor

Reducing the domain size

\(\mathsf{R}\) \(x\) \(y\)
1 2
2 1
3 0

\(\tilde{\mathsf{R}}^b\) \(x^2\) \(x^1\) \(x^0\) \(y^2\) \(y^1\) \(y^0\)
0 0 1 0 1 0
0 1 0 0 0 1
0 1 1 0 0 0

\(b = 3\) bits

Reducing the domain size

\(Q\) \(\tilde{Q}^b\)

Domain of size 2

\(b = \mathsf{log}|\mathsf{dom}|\implies n\cdot b\) variables

\(\mathsf{wc}(Q, N) = \mathsf{wc}(\tilde{Q}^b, N)\)

The complexity of the branch and bound algorithm is

\[ \tilde{\mathcal{O}}(m|\mathsf{dom}|\cdot n \cdot \mathsf{wc}(Q, N)) \]

\[ \tilde{\mathcal{O}}(mn \cdot \mathsf{wc}(Q, N)) \] The algorithm is worst-case optimal

Sampling answers uniformly

Overview

Sampling an answer is sampling one of the leaves

Complexity of Sampling

This is easy if we know how many leaves we have down each path.

Of course, we don’t

But it can also work (with an error) if we have a super-additive upper-bound on the number of leaves. That is, if \(t\) has children \(t_1,\dots, t_n\), then \(\mathsf{upb}(t) \geqslant \sum_{i=1}^n \mathsf{upb}(t_i)\).

A Recursive Sampling algorithm

Return with probability 1

Fail with probability 1

Sample from \(t_i\) with probability \(\frac{\mathsf{upb}(t_i)}{\mathsf{upb}(t)}\), fail otherwise

A Recursive Sampling algorithm

For a tree \(T\) rooted in \(r\), \(\mathsf{upb}\) a super-additive leaf estimator and \(\mathsf{out}\) the output of our algorithm. Then for any -leaf \(l\), the algorithm is a uniform Las Vegas sampler with guarantees: \[ \mathsf{Pr}(\mathsf{out} = l) = \frac{1}{\mathsf{upb}(T)} \qquad \mathsf{Pr}(\mathsf{out} = \mathsf{fail}) = 1 - \frac{|\top\mathsf{-leaves}(T)|}{\mathsf{upb}(T)} \]

Case \(\mathcal{C}[\leqslant N]\): for each node \(\tau, \mathsf{upb}(\tau) \coloneq \prod_R|R[\tau]|^{\lambda_R}\) where \((\lambda_R)\) are chosen from the AGM bound is superadditive

Sampling complexity

Given a class of queries \(\mathcal{C}[\leqslant N]\), for any query \(Q \in \mathcal{C}[\leqslant N]\), it is possible to uniformly sample from the answer set with expected time \[ \tilde{\mathcal{O}}(\frac{\mathsf{wc}(Q, N)}{\mathsf{max}(1, |\mathsf{ans}(Q)|)} \cdot nm \cdot \mathsf{log}|\mathsf{dom}|) \]

Matches existing complexity results for uniform sampling

Conclusion

We have a simple algorithm for the worst case optimal join

We can sample uniformly from the answer set of the queries

We have presented the work for classes of queries defined by cardinality constraints \(\mathcal{C}[\leqslant N]\), but these algorithms also work for classes of queries defined by acyclic degree constraints:

  • if we have an order on the variables that is compatible;
  • and by using the polymatroid bound as the sampling upper bound estimator