Florent Capelli[1], Oliver Irwin[2], Sylvain Salvati[2]
23/05/2025 - SISE thematic group seminar
[1] - CRIL / Université d’Artois
[2] - CRIStAL / Université de Lille
The good, the bad (and not the ugly)
\(Q \coloneq R(x, y) \wedge S(x, z) \wedge T(y, z)\)
R | x | y |
---|---|---|
0 | 0 | |
0 | 1 | |
2 | 1 |
S | x | z |
---|---|---|
0 | 0 | |
0 | 2 | |
2 | 3 |
T | y | z |
---|---|---|
0 | 2 | |
1 | 0 | |
1 | 2 |
\((R \bowtie S) \bowtie T\)
\((R \bowtie S)\) \(\bowtie T\)
\((R \bowtie S)\) \(\bowtie T\)
\((R \bowtie S) \bowtie T\)
R \(\bowtie\) S | x | y | z |
---|---|---|---|
0 | 0 | 0 | |
0 | 0 | 2 | |
0 | 1 | 0 | |
0 | 1 | 2 | |
2 | 1 | 3 |
R \(\bowtie\) S \(\bowtie\) T | x | y | z |
---|---|---|---|
0 | 0 | 0 | |
0 | 0 | 2 | |
0 | 1 | 0 | |
0 | 1 | 2 | |
2 | 1 | 3 |
R \(\bowtie\) S \(\bowtie\) T | x | y | z |
---|---|---|---|
0 | 0 | 2 | |
0 | 1 | 0 | |
0 | 1 | 2 |
\(Q \coloneq R(x, y) \wedge S(x, z) \wedge T(y, z)\)
R | x | y |
---|---|---|
0 | 0 | |
0 | 1 | |
2 | 1 |
S | x | z |
---|---|---|
0 | 0 | |
0 | 2 | |
2 | 3 |
T | y | z |
---|---|---|
0 | 2 | |
1 | 0 | |
1 | 2 |
\[Q \coloneq R(x, y) \wedge S(x, z) \wedge T(y, z)\]
Consider \(\mathbb{D}\) on domain \(D = D_1 \uplus D_2 \uplus D_3\) with:
R | x | y |
---|---|---|
0 | \(D_2\) | |
\(D_1\) | 0 |
S | x | z |
---|---|---|
0 | \(D_3\) | |
\(D_1\) | 0 |
T | y | z |
---|---|---|
0 | \(D_3\) | |
\(D_2\) | 0 |
Every query plan will materialise a table of size \(\mathcal{O}(N^2)\) but the answer table will never be of size greater than \((2N)^{1.5}\).
\(Q \coloneq R(x, y) \wedge S(x, z) \wedge T(y, z)\)
Ideal complexity: output \(Q(\mathbb{D})\) in time \(\mathcal{O}(f(|Q|) \cdot |Q(\mathbb{D})|)\)…
… unlikely to be possible.
Worst case optimal: output \(Q(\mathbb{D})\) in time \(\tilde{\mathcal{O}}(f(|Q|) \cdot N^{1.5})\).
\(N\) is the size of the largest input relation and \(\tilde{\mathcal{O}}(\cdot)\) ignores polylog factors.
\(f(|Q|)\): data complexity, ie, \(Q\) is considered constant. Ideally, \(f\) is a reasonable polynomial though.
Consider a join query \(Q\) and all databases for \(Q\) with a bound \(N\) on the table size:
\[ \mathcal{C}[\leqslant N]= \{\mathbb{D}\mid \forall R \in Q, |R^\mathbb{D}| \leqslant N\} \]
and let:
\[ \mathsf{wc}(Q, N) = \mathsf{sup}_{\mathbb{D}\in\mathcal{C}[\leqslant N]}~|Q(\mathbb{D})| \]
\(\mathsf{wc}(Q,N)\) is the worst case: the size of the biggest answer set possible with query \(Q\) and databases where each table are bounded by \(N\).
We know how to compute \(\rho(Q)\) such that \(\mathsf{wc}(Q,N) = \tilde{\mathcal{O}}(N^{\rho(Q)})\) but we do not need it!
This is known as the AGM-bound
A join algorithm is worst case optimal (wrt \(\mathcal{C}[\leqslant N]\)) if for every \(Q\), \(N \in \mathbb{N}\) and \(\mathbb{D}\in \mathcal{C}[\leqslant N]\), it computes \(Q(\mathbb{D})\) in time \[\tilde{\mathcal{O}}(f(|Q|) \cdot \mathsf{wc}(Q,N))\]
The DBMS approach is not worst case optimal (triangle example from before).
Rich literature:
We prove the worst case optimality of the branch and bound algorithm in an elementary way.
What about the complexity of this algorithm?
One recursive call:
R | x | y |
---|---|---|
0 | 0 | |
0 | 2 | |
1 | 0 | |
1 | 1 | |
2 | 0 | |
2 | 1 |
R | x | y |
---|---|---|
0 | 0 | |
0 | 2 | |
1 | 0 | |
1 | 1 | |
2 | 0 | |
2 | 1 |
R | x | y |
---|---|---|
0 | 0 | |
0 | 2 | |
1 | 0 | |
1 | 1 | |
2 | 0 | |
2 | 1 |
Total complexity: number of recursive calls times \(\tilde{\mathcal{O}}(m)\) where \(m\) is the number of atoms.
\[Q \coloneq R(x, y) \wedge S(x, z) \wedge T(y, z)\]
R | x | y |
---|---|---|
0 | 0 | |
0 | 1 | |
2 | 1 |
S | x | z |
---|---|---|
0 | 0 | |
0 | 2 | |
2 | 3 |
T | y | z |
---|---|---|
0 | 2 | |
1 | 0 | |
1 | 2 |
\[Q_2 \coloneq R_2(x, y) \wedge S_2(x) \wedge T_2(y)\]
R\(_2\) | x | y |
---|---|---|
0 | 0 | |
0 | 1 | |
2 | 1 |
S\(_2\) | x |
---|---|
0 | |
2 |
T\(_2\) | y |
---|---|
0 | |
1 |
We get one call for each node of our graph. Two cases arise:
a prefix \(\tau = x\gets d_1, \dots, x_i \gets d_i\) is consistent, that is \(\tau \in \mathsf{ans}(Q_i)\).
Z
\(\leqslant \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|\) calls
a prefix \(\tau\) is not consistent. But it comes from a prefix \(\tau'\) that is consistent, and for each \(\tau'\) we have \(|\mathsf{dom}|\) possibilities.
\(\leqslant |\mathsf{dom}| \cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|\) calls
Z
\(\leqslant (|\mathsf{dom}| + 1) \cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|\) calls
\[Q_i = \bigwedge_{R\in Q}\prod_{x\dots x_i} R\]
The complexity of the branch and bound algorithm is \[ \tilde{\mathcal{O}}(m|\mathsf{dom}|\cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|) \]
For \(Q = \bigwedge_i R_i\), we consider the classes of queries \(\mathcal{C}[\leqslant N]\) such that \[ \mathcal{C}[\leqslant N]= \{\mathbb{D}\mid \forall i, |R_i^\mathbb{D}| \leqslant N\} \]
We can then define the worst case as: \[ \mathsf{wc}(Q, N) = \mathsf{sup}_{\mathbb{D}\in\mathcal{C}[\leqslant N]}(|\mathsf{ans}(Q, \mathbb{D})|) \]
\(Q_\Delta = R(x, y) \wedge S(x, z) \wedge T(y, z)\)
\(\mathsf{wc}(Q_\Delta, N) = \tilde{\mathcal{O}}(N^{3/2})\)
\[\begin{align} |\mathsf{ans}(Q_i)| &= |\mathsf{ans}(\bigwedge_{R\in Q}\prod_{x\dots x_i} R)|\\ &= |\mathsf{ans}(\underbrace{\bigwedge_{R\in Q}\prod_{x\dots x_i} R \times \{0\}^{X_R\setminus \{x\dots x_i\}})}_{\in\ \mathcal{C}[\leqslant N]}| \leqslant \mathsf{wc}(\mathcal{C}[\leqslant N]) \end{align}\]
The complexity of the branch and bound algorithm is
\[ \tilde{\mathcal{O}}(m|\mathsf{dom}|\cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|) \]
\[ \tilde{\mathcal{O}}(m|\mathsf{dom}|\cdot n \cdot \mathsf{wc}(Q, N)) \]
\[ \tilde{\mathcal{O}}(nm \cdot |\mathsf{dom}| \cdot \mathsf{wc}(Q, N)) \]
We do not even need to know \(\mathsf{wc}(Q, N)\) to prove it 🤯
\(\mathsf{R}\) | x | y |
---|---|---|
1 | 2 | |
2 | 1 | |
3 | 0 |
⇝
\(\tilde{\mathsf{R}}^b\) | \(x^2\) | \(x^1\) | \(x^0\) | \(y^2\) | \(y^1\) | \(y^0\) | |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 1 | 0 | ||
0 | 1 | 0 | 0 | 0 | 1 | ||
0 | 1 | 1 | 0 | 0 | 0 |
\(b = 3\) bits
The complexity of the branch and bound algorithm is
\[ \tilde{\mathcal{O}}(nm \cdot |\mathsf{dom}| \cdot \mathsf{wc}(Q, N)) \]
\[ \tilde{\mathcal{O}}(nm \cdot \mathsf{wc}(Q, N)) \]
Given \(Q\) and \(\mathbb{D}\), sample \(\tau \in Q(\mathbb{D})\) with probability \(\frac{1}{|Q(\mathbb{D})|}\) or fail if \(Q(\mathbb{D}) = \emptyset\).
Naive algorithm:
Complexity using WCOJ:
\(\tilde{\mathcal{O}}(\mathsf{wc}(Q,N) \mathsf{poly}(|Q|))\).
We can do better: (expected) time \(\tilde{\mathcal{O}}(\frac{\mathsf{wc}(Q,N)}{|Q(\mathbb{D})|+1} \mathsf{poly}(|Q|))\)
PODS ’23: [Deng, Lu, Tao] and [Kim, Ha, Fletcher, Han]
Sampling an answer is sampling one of the leaves
Of course, we do not know \(\ell(t)\)…
Only makes sense if \(\sum_i upb(t_i) \leqslant upb(t)\).
Las Vegas uniform sampling algorithm:
Repeat until output: \(\mathcal{O}(\frac{upb(r)}{\ell(r)})\) expected calls, where \(r\) is the root.
AGM bound: there exists positive rational numbers \((\lambda_R)_{R \in Q}\) such that \[|Q(\mathbb{D})| \leq \prod_{R \in Q}|R^\mathbb{D}|^{\lambda_R} \leqslant \mathsf{wc}(Q,N)\]
Define \(\mathsf{upb}(t) = \prod_{R \in Q}|{\color{red}R^\mathbb{D}[\tau_t]}|^{\lambda_R} \leq \mathsf{wc}(Q,N)\):
For a tree \(T\) rooted in \(r\), \(\mathsf{upb}\) a super-additive leaf estimator and \(\mathsf{out}\) the output of our algorithm. Then for any -leaf \(l\), the algorithm is a uniform Las Vegas sampler with guarantees: \[ \mathsf{Pr}(\mathsf{out} = l) = \frac{1}{\mathsf{upb}(T)} \qquad \mathsf{Pr}(\mathsf{out} = \mathsf{fail}) = 1 - \frac{|\top\mathsf{-leaves}(T)|}{\mathsf{upb}(T)} \]
Given a class of queries \(\mathcal{C}[\leqslant N]\), for any query \(Q \in \mathcal{C}[\leqslant N]\), it is possible to uniformly sample from the answer set with expected time \[ \tilde{\mathcal{O}}(\frac{\mathsf{wc}(Q, N)}{\mathsf{max}(1, |\mathsf{ans}(Q)|)} \cdot nm \cdot \mathsf{log}|\mathsf{dom}|) \]
Matches existing complexity results for uniform sampling
We have a simple algorithm for the worst case optimal join
We can sample uniformly from the answer set of the queries
We have presented the work for classes of queries defined by cardinality constraints \(\mathcal{C}[\leqslant N]\), but these algorithms also work for classes of queries defined by acyclic degree constraints: