A Simple Algorithm for WCOJ & Sampling

Florent Capelli^[1], Oliver Irwin^[2], Sylvain Salvati^[2]

23/05/2025 - SISE thematic group seminar

[1] - CRIL / Université d’Artois

[2] - CRIStAL / Université de Lille

Joining relations

The good, the bad (and not the ugly)

Joining like the good

\(Q \coloneq R(x, y) \wedge S(x, z) \wedge T(y, z)\)

R	x	y
	0	0
	0	1
	2	1

S	x	z
	0	0
	0	2
	2	3

T	y	z
	0	2
	1	0
	1	2

Devise a query plan:

\((R \bowtie S) \bowtie T\)

\((R \bowtie S)\) \(\bowtie T\)

\((R \bowtie S) \bowtie T\)

Materialise intermediate joins

x	y	z
0	0	0
0	0	2
0	1	0
0	1	2
2	1	3

x	y	z
0	0	0
0	0	2
0	1	0
0	1	2
2	1	3

R \(\bowtie\) S \(\bowtie\) T	x	y	z
	0	0	2
	0	1	0
	0	1	2

Joining like the bad

\(Q \coloneq R(x, y) \wedge S(x, z) \wedge T(y, z)\)

R	x	y
	0	0
	0	1
	2	1

S	x	z
	0	0
	0	2
	2	3

T	y	z
	0	2
	1	0
	1	2

What’s wrong about joining like the good?

\[Q \coloneq R(x, y) \wedge S(x, z) \wedge T(y, z)\]

It is known that if \(|R^\mathbb{D}|, |S^\mathbb{D}|, |T^\mathbb{D}| \leqslant N\), then \(|Q(\mathbb{D})| \leqslant N^{1.5}\).
\(R \bowtie S\) may have \(N^2\) answers!

Worst scenario for query plans

Consider \(\mathbb{D}\) on domain \(D = D_1 \uplus D_2 \uplus D_3\) with:

\(0 \notin D\)
\(|D_1|=|D_2|=|D_3|=N\).

R	x	y
	0	\(D_2\)
	\(D_1\)	0

S	x	z
	0	\(D_3\)
	\(D_1\)	0

T	y	z
	0	\(D_3\)
	\(D_2\)	0

\(|R^\mathbb{D}\bowtie S^\mathbb{D}| \geqslant N^2\), \(|R^\mathbb{D}\bowtie T^\mathbb{D}| \geqslant N^2\), \(|S^\mathbb{D}\bowtie T^\mathbb{D}| \geqslant N^2\)
\(|Q(\mathbb{D})| = 0\).

Every query plan will materialise a table of size \(\mathcal{O}(N^2)\) but the answer table will never be of size greater than \((2N)^{1.5}\).

Worst-case optimality

Worst-case optimal join

\(Q \coloneq R(x, y) \wedge S(x, z) \wedge T(y, z)\)

Ideal complexity: output \(Q(\mathbb{D})\) in time \(\mathcal{O}(f(|Q|) \cdot |Q(\mathbb{D})|)\)…

… unlikely to be possible.

Worst case optimal: output \(Q(\mathbb{D})\) in time \(\tilde{\mathcal{O}}(f(|Q|) \cdot N^{1.5})\).

\(N\) is the size of the largest input relation and \(\tilde{\mathcal{O}}(\cdot)\) ignores polylog factors.

\(f(|Q|)\): data complexity, ie, \(Q\) is considered constant. Ideally, \(f\) is a reasonable polynomial though.

Worst case value

Consider a join query \(Q\) and all databases for \(Q\) with a bound \(N\) on the table size:

\[ \mathcal{C}[\leqslant N]= \{\mathbb{D}\mid \forall R \in Q, |R^\mathbb{D}| \leqslant N\} \]

and let:

\[ \mathsf{wc}(Q, N) = \mathsf{sup}_{\mathbb{D}\in\mathcal{C}[\leqslant N]}~|Q(\mathbb{D})| \]

\(\mathsf{wc}(Q,N)\) is the worst case: the size of the biggest answer set possible with query \(Q\) and databases where each table are bounded by \(N\).

Worst case examples

Cartesian product: \(Q_2 = R_1(x) \wedge R_2(y)\) has \(\mathsf{wc}(Q_2,N) = N^2\).
Similarly: \(Q_k = R_1(x) \wedge \dots \wedge R_k(x_k)\) has \(\mathsf{wc}(Q_2,N) = N^k\).
Square query: \(Q_\square = R(x,y) \wedge R(y,z) \wedge R(z,t) \wedge R(t,x)\) has \(\mathsf{wc}(Q_\square,N)=N^2\).
Triangle query: \(Q_\Delta = R(x, y) \wedge S(x, z) \wedge T(y, z)\), \(\mathsf{wc}(Q_\Delta, N) = N^{1.5}\).
The n-cycle: \(Q_{C_n}(x_1,\dots,x_n) = R_1(x_1,x_2) \wedge R_2(x_2,x_3) \wedge \dots \wedge R_n(x_{n},x_1)\): \(\mathsf{wc}(Q_{C_n})=N^\frac{n}{2}\).

We know how to compute \(\rho(Q)\) such that \(\mathsf{wc}(Q,N) = \tilde{\mathcal{O}}(N^{\rho(Q)})\) but we do not need it!

This is known as the AGM-bound

Worst-Case Optimal Join (WCOJ) algorithms

A join algorithm is worst case optimal (wrt \(\mathcal{C}[\leqslant N]\)) if for every \(Q\), \(N \in \mathbb{N}\) and \(\mathbb{D}\in \mathcal{C}[\leqslant N]\), it computes \(Q(\mathbb{D})\) in time \[\tilde{\mathcal{O}}(f(|Q|) \cdot \mathsf{wc}(Q,N))\]

Data complexity model: \(Q\) considered constant hence \(f(|Q|)\) also.
In this talk, \(f\) will be a reasonable polynomial!

The DBMS approach is not worst case optimal (triangle example from before).

Existing WCOJ Algorithm

Rich literature:

NPRR Join (Ngo, Porat, Ré, Rudra, PODS ’12): usual join plans but with relations partitionned into high/low degree tuples.
Leapfrog Triejoin (Veldhuizen, ICDT ’14)
Generic Join (Ngo, PODS ’18): both branch and bound algorithms as ours but more complex analysis/data structures.
PANDA (PODS ’17): handle complex database constraints, very complex, long analysis.

We prove the worst case optimality of the branch and bound algorithm in an elementary way.

Analysing the brute

A simple algorithm for joins

What about the complexity of this algorithm?

Complexity analysis

One recursive call:

branch variable \(x_i\) on value \(d \in \mathsf{dom}\)
filter/project relations with \(x_i\)
Binary search in \(\mathcal{O}(\log |R|)\) if \(R\) ordered
(\(\mathcal{O}(1)\) possible using tries).

R	x	y
	0	0
	0	2
	1	0
	1	1
	2	0
	2	1

R	x	y
	0	0
	0	2
	1	0
	1	1
	2	0
	2	1

R	x	y
	0	0
	0	2
	1	0
	1	1
	2	0
	2	1

Total complexity: number of recursive calls times \(\tilde{\mathcal{O}}(m)\) where \(m\) is the number of atoms.

Number of calls: example

Nodes: partial assignment \(\tau\)
Here: \(\tau := \{x=0, y=1\}\)
\(\bot\) node: partial assignment compatible with every relation
\(\tau\) solution of \(Q_2(\mathbb{D}_2)\): project on \(x,y\).
At most: \(|Q_2(\mathbb{D}_2)|\) such nodes at level \(3\)

\[Q \coloneq R(x, y) \wedge S(x, z) \wedge T(y, z)\]

R	x	y
	0	0
	0	1
	2	1

S	x	z
	0	0
	0	2
	2	3

T	y	z
	0	2
	1	0
	1	2

\[Q_2 \coloneq R_2(x, y) \wedge S_2(x) \wedge T_2(y)\]

R\(_2\)	x	y
	0	0
	0	1
	2	1

S\(_2\)	x
	0
	2

T\(_2\)	y
	0
	1

Number of calls

We get one call for each node of our graph. Two cases arise:

a prefix \(\tau = x\gets d_1, \dots, x_i \gets d_i\) is consistent, that is \(\tau \in \mathsf{ans}(Q_i)\).

\(\leqslant \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|\) calls

a prefix \(\tau\) is not consistent. But it comes from a prefix \(\tau'\) that is consistent, and for each \(\tau'\) we have \(|\mathsf{dom}|\) possibilities.

\(\leqslant |\mathsf{dom}| \cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|\) calls

\(\leqslant (|\mathsf{dom}| + 1) \cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|\) calls

\[Q_i = \bigwedge_{R\in Q}\prod_{x\dots x_i} R\]

Algorithm Complexity

The complexity of the branch and bound algorithm is \[ \tilde{\mathcal{O}}(m|\mathsf{dom}|\cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|) \]

Towards Worst-Case Optimality

The Holy Grail of Join Algorithms

For \(Q = \bigwedge_i R_i\), we consider the classes of queries \(\mathcal{C}[\leqslant N]\) such that \[ \mathcal{C}[\leqslant N]= \{\mathbb{D}\mid \forall i, |R_i^\mathbb{D}| \leqslant N\} \]

We can then define the worst case as: \[ \mathsf{wc}(Q, N) = \mathsf{sup}_{\mathbb{D}\in\mathcal{C}[\leqslant N]}(|\mathsf{ans}(Q, \mathbb{D})|) \]

Quick, a triangle!

\(Q_\Delta = R(x, y) \wedge S(x, z) \wedge T(y, z)\)

\(\mathsf{wc}(Q_\Delta, N) = \tilde{\mathcal{O}}(N^{3/2})\)

WCO for cardinality constraints

\[\begin{align} |\mathsf{ans}(Q_i)| &= |\mathsf{ans}(\bigwedge_{R\in Q}\prod_{x\dots x_i} R)|\\ &= |\mathsf{ans}(\underbrace{\bigwedge_{R\in Q}\prod_{x\dots x_i} R \times \{0\}^{X_R\setminus \{x\dots x_i\}})}_{\in\ \mathcal{C}[\leqslant N]}| \leqslant \mathsf{wc}(\mathcal{C}[\leqslant N]) \end{align}\]

The complexity of the branch and bound algorithm is

\[ \tilde{\mathcal{O}}(m|\mathsf{dom}|\cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|) \]

\[ \tilde{\mathcal{O}}(m|\mathsf{dom}|\cdot n \cdot \mathsf{wc}(Q, N)) \]

\[ \tilde{\mathcal{O}}(nm \cdot |\mathsf{dom}| \cdot \mathsf{wc}(Q, N)) \]

We do not even need to know \(\mathsf{wc}(Q, N)\) to prove it 🤯

Reducing the domain size

\(\mathsf{R}\)	x	y
	1	2
	2	1
	3	0

⇝

\(x^1\)	\(x^0\)	\(y^1\)	\(y^0\)
0	1	1	0
1	0	0	1
1	1	0	0

\(b = 3\) bits

\(Q\) ⇝ \(\tilde{Q}^b\) has \(bn\) variables
\(\mathbb{D}\) ⇝ \(\tilde{\mathbb{D}}^b\) for \(b = \log |\mathsf{dom}|\). Database has roughly the same bitsize but size \(2\) domain!

WCOJ finally

To compute \(Q(\mathbb{D})\) run simple branch and bound algorithm on \((\tilde{Q}^b, \tilde{\mathbb{D}}^b)\):
runs in time \(\tilde{\mathcal{O}}(m \cdot (n\log |\mathsf{dom}|) \cdot {\color{green}2} \mathsf{wc}(\tilde{Q}^{b}, N, 2))\)
where \(\mathsf{wc}(\tilde{Q}^{b}, N, 2)\) is the worst case for \(\tilde{Q}^b\) on relations of size \(\leqslant N\) and domain \(2\).
\(\mathsf{wc}(\tilde{Q}^{b}, N, 2) \leqslant \mathsf{wc}(Q,N)\) by reconverting back to larger domain.

The complexity of the branch and bound algorithm is

\[ \tilde{\mathcal{O}}(nm \cdot |\mathsf{dom}| \cdot \mathsf{wc}(Q, N)) \]

\[ \tilde{\mathcal{O}}(nm \cdot \mathsf{wc}(Q, N)) \]

Sampling answers uniformly

Problem statement

Given \(Q\) and \(\mathbb{D}\), sample \(\tau \in Q(\mathbb{D})\) with probability \(\frac{1}{|Q(\mathbb{D})|}\) or fail if \(Q(\mathbb{D}) = \emptyset\).

Naive algorithm:

materialize \(Q(\mathbb{D})\) in a table
sample \(i \leq |Q(\mathbb{D})|\) uniformly
output \(Q(\mathbb{D})[i]\).

Complexity using WCOJ:

\(\tilde{\mathcal{O}}(\mathsf{wc}(Q,N) \mathsf{poly}(|Q|))\).

We can do better: (expected) time \(\tilde{\mathcal{O}}(\frac{\mathsf{wc}(Q,N)}{|Q(\mathbb{D})|+1} \mathsf{poly}(|Q|))\)

PODS ’23: [Deng, Lu, Tao] and [Kim, Ha, Fletcher, Han]

Revisiting the problem

Sampling an answer is sampling one of the leaves

Sampling leaves, the easy way

\(\ell(t)\): number of -leaves below \(t\) is known
Recursively sample uniformly a -leaf in \(t_i\) with probability \(\frac{\ell(t_i)}{\ell(t)}\).
A leaf in \(\ell(t_i)\) will be sampled with probability \[\frac{1}{\ell(t_i)} \times \frac{\ell(t_i)}{\ell(t)} = \frac{1}{\ell(t)}\] Uniform!

Of course, we do not know \(\ell(t)\)…

Sampling leaves with a nice oracle

\(upb(t)\): upperbound on the number of -leaves below \(t\) is known
Recursively sample uniformly a -leaf in \(t_i\) with probability \(\frac{upb(t_i)}{upb(t)}\).
Fail with probability \(1 - \sum_i \frac{upb(t_i)}{upb(t)}\) or upon encountering .

Only makes sense if \(\sum_i upb(t_i) \leqslant upb(t)\).

Las Vegas uniform sampling algorithm:

each leaf is output with probability \(\frac{1}{ubp(t)}\),
fails with proba \(1 - \frac{\ell(t)}{upb(t)}\) where \(\ell(t)\) is the number of -leaves under \(t\).

Repeat until output: \(\mathcal{O}(\frac{upb(r)}{\ell(r)})\) expected calls, where \(r\) is the root.

Upper bound oracles for conjunctive queries

Node \(t\): partial assignment \(\tau_t := (x_1=d_1, \dots, x_i=d_i)\)
Number of leaves below \(t\): \(|Q(\mathbb{D})[\tau_t]|\).
\(\mathsf{upb}(t)\) ?? \(\rightarrow\) look for worst case bounds!

AGM bound: there exists positive rational numbers \((\lambda_R)_{R \in Q}\) such that \[|Q(\mathbb{D})| \leq \prod_{R \in Q}|R^\mathbb{D}|^{\lambda_R} \leqslant \mathsf{wc}(Q,N)\]

Define \(\mathsf{upb}(t) = \prod_{R \in Q}|{\color{red}R^\mathbb{D}[\tau_t]}|^{\lambda_R} \leq \mathsf{wc}(Q,N)\):

it is an upper bound on \(|Q(\mathbb{D})[\tau_t]|\),
it is supperadditive: \(\mathsf{upb}(t) \geqslant \sum_{d \in \mathsf{dom}} \mathsf{upb}(t_d)\)
value of \(\mathsf{upb}\) at the root of the tree: \(\mathsf{wc}(Q,N)\)!

Sampling from CQs

For a tree \(T\) rooted in \(r\), \(\mathsf{upb}\) a super-additive leaf estimator and \(\mathsf{out}\) the output of our algorithm. Then for any -leaf \(l\), the algorithm is a uniform Las Vegas sampler with guarantees: \[ \mathsf{Pr}(\mathsf{out} = l) = \frac{1}{\mathsf{upb}(T)} \qquad \mathsf{Pr}(\mathsf{out} = \mathsf{fail}) = 1 - \frac{|\top\mathsf{-leaves}(T)|}{\mathsf{upb}(T)} \]

Sampling complexity

Given a class of queries \(\mathcal{C}[\leqslant N]\), for any query \(Q \in \mathcal{C}[\leqslant N]\), it is possible to uniformly sample from the answer set with expected time \[ \tilde{\mathcal{O}}(\frac{\mathsf{wc}(Q, N)}{\mathsf{max}(1, |\mathsf{ans}(Q)|)} \cdot nm \cdot \mathsf{log}|\mathsf{dom}|) \]

Matches existing complexity results for uniform sampling

Conclusion

We have a simple algorithm for the worst case optimal join

We can sample uniformly from the answer set of the queries

We have presented the work for classes of queries defined by cardinality constraints \(\mathcal{C}[\leqslant N]\), but these algorithms also work for classes of queries defined by acyclic degree constraints:

if we have an order on the variables that is compatible;
and by using the polymatroid bound as the sampling upper bound estimator