Florent Capelli[1], Oliver Irwin[2], Sylvain Salvati[2]
25/03/2025 - ICDT 2025
[1] - CRIL / Université d’Artois
[2] - CRIStAL / Université de Lille
\(Q \coloneq R(x, y) \wedge S(x, z) \wedge T(y, z)\)
R | x | y |
---|---|---|
0 | 0 | |
0 | 1 | |
2 | 1 |
S | x | z |
---|---|---|
0 | 0 | |
0 | 2 | |
2 | 3 |
T | y | z |
---|---|---|
0 | 2 | |
1 | 0 | |
1 | 2 |
What about the complexity of this algorithm?
Now we have to bound how many calls we make
We get one call for each node of our graph. Two cases arise:
a prefix \(\tau = x_1\gets d_1, \dots, x_i \gets d_i\) is consistent, that is \(\tau \in \mathsf{ans}(Q_i)\).
Z
\(\sum_{i\leqslant n}|\mathsf{ans}(Q_i)|\) calls
a prefix \(\tau\) is not consistent. But it comes from a prefix \(\tau'\) that is consistent, and for each \(\tau'\) we have \(|\mathsf{dom}|\) possibilities.
\(|\mathsf{dom}| \cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|\) calls
Z
\((|\mathsf{dom}| + 1) \cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|\) calls
\[Q_i = \bigwedge_{R\in Q}\prod_{x_1\dots x_i} R\]
The complexity of the branch and bound algorithm is \[ \tilde{\mathcal{O}}(m|\mathsf{dom}|\cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|) \]
For \(Q = \bigwedge_i R_i\), we consider the classes of queries \(\mathcal{C}[\leqslant N]\) such that \[ \mathcal{C}[\leqslant N]= \{\mathbb{D}\mid \forall i, |R_i^\mathbb{D}| \leqslant N\} \]
We can then define the worst case as: \[ \mathsf{wc}(Q, N) = \mathsf{sup}_{\mathbb{D}\in\mathcal{C}[\leqslant N]}(|\mathsf{ans}(Q, \mathbb{D})|) \]
\(Q_\Delta = R(x, y) \wedge S(x, z) \wedge T(y, z)\)
\(\mathsf{wc}(Q_\Delta, N) = \tilde{\mathcal{O}}(N^{3/2})\)
\[\begin{align} |\mathsf{ans}(Q_i)| &= |\mathsf{ans}(\bigwedge_{R\in Q}\prod_{x_1\dots x_i} R)|\\ &= |\mathsf{ans}(\underbrace{\bigwedge_{R\in Q}\prod_{x_1\dots x_i} R \times \{0\}^{X_R\setminus \{x_1\dots x_i\}})}_{\in\ \mathcal{C}[\leqslant N]}| \leqslant \mathsf{wc}(\mathcal{C}[\leqslant N]) \end{align}\]
The complexity of the branch and bound algorithm is
\[ \tilde{\mathcal{O}}(m|\mathsf{dom}|\cdot \sum_{i\leqslant n}|\mathsf{ans}(Q_i)|) \]
\[ \tilde{\mathcal{O}}(m|\mathsf{dom}|\cdot n \cdot \mathsf{wc}(Q, N)) \]
Now we remove the \(|\mathsf{dom}|\) factor
\(\mathsf{R}\) | \(x\) | \(y\) |
---|---|---|
1 | 2 | |
2 | 1 | |
3 | 0 |
⇝
\(\tilde{\mathsf{R}}^b\) | \(x^2\) | \(x^1\) | \(x^0\) | \(y^2\) | \(y^1\) | \(y^0\) | |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 1 | 0 | ||
0 | 1 | 0 | 0 | 0 | 1 | ||
0 | 1 | 1 | 0 | 0 | 0 |
\(b = 3\) bits
\(Q\) ⇝ \(\tilde{Q}^b\)
Domain of size 2
\(b = \mathsf{log}|\mathsf{dom}|\implies n\cdot b\) variables
\(\mathsf{wc}(Q, N) = \mathsf{wc}(\tilde{Q}^b, N)\)
The complexity of the branch and bound algorithm is
\[ \tilde{\mathcal{O}}(m|\mathsf{dom}|\cdot n \cdot \mathsf{wc}(Q, N)) \]
\[ \tilde{\mathcal{O}}(mn \cdot \mathsf{wc}(Q, N)) \] The algorithm is worst-case optimal
Sampling an answer is sampling one of the leaves
This is easy if we know how many leaves we have down each path.
Of course, we don’t
But it can also work (with an error) if we have a super-additive upper-bound on the number of leaves. That is, if \(t\) has children \(t_1,\dots, t_n\), then \(\mathsf{upb}(t) \geqslant \sum_{i=1}^n \mathsf{upb}(t_i)\).
Return with probability 1
Fail with probability 1
Sample from \(t_i\) with probability \(\frac{\mathsf{upb}(t_i)}{\mathsf{upb}(t)}\), fail otherwise
For a tree \(T\) rooted in \(r\), \(\mathsf{upb}\) a super-additive leaf estimator and \(\mathsf{out}\) the output of our algorithm. Then for any -leaf \(l\), the algorithm is a uniform Las Vegas sampler with guarantees: \[ \mathsf{Pr}(\mathsf{out} = l) = \frac{1}{\mathsf{upb}(T)} \qquad \mathsf{Pr}(\mathsf{out} = \mathsf{fail}) = 1 - \frac{|\top\mathsf{-leaves}(T)|}{\mathsf{upb}(T)} \]
Case \(\mathcal{C}[\leqslant N]\): for each node \(\tau, \mathsf{upb}(\tau) \coloneq \prod_R|R[\tau]|^{\lambda_R}\) where \((\lambda_R)\) are chosen from the AGM bound is superadditive
Given a class of queries \(\mathcal{C}[\leqslant N]\), for any query \(Q \in \mathcal{C}[\leqslant N]\), it is possible to uniformly sample from the answer set with expected time \[ \tilde{\mathcal{O}}(\frac{\mathsf{wc}(Q, N)}{\mathsf{max}(1, |\mathsf{ans}(Q)|)} \cdot nm \cdot \mathsf{log}|\mathsf{dom}|) \]
Matches existing complexity results for uniform sampling
We have a simple algorithm for the worst case optimal join
We can sample uniformly from the answer set of the queries
We have presented the work for classes of queries defined by cardinality constraints \(\mathcal{C}[\leqslant N]\), but these algorithms also work for classes of queries defined by acyclic degree constraints: