Florent Capelli[1], Oliver Irwin[2]
25/03/2024 - ICDT 2024
[1] - CRIL / Université d’Artois
[2] - CRIStAL / Université de Lille
Join Query : \(Q(x_1, \dots, x_n) = \bigwedge_{i=1}^k R_i(\vec{z_i})\)
where \(\vec{z_i}\) is a tuple over \(X = \{x_1,\dots,x_n\}\)
Example: \(Q(city, country, name, id) = People(id, name, city) \wedge Capitals(city, country)\)
id | name | city |
---|---|---|
1 | Alice | Paris |
2 | Bob | Lens |
3 | Chiara | Rome |
4 | Djibril | Berlin |
5 | Émile | Dortmund |
6 | Francesca | Rome |
city | country |
---|---|
Berlin | Germany |
Paris | France |
Rome | Italy |
city | country | name | id |
---|---|---|---|
Paris | France | Alice | 1 |
Rome | Italy | Chiara | 3 |
Berlin | Germany | Djibril | 4 |
Rome | Italy | Francesca | 6 |
We want to access the \(k\)-th element of \(Q(\mathbb{D})\) in the lexicographical order induced by a given order over the variables.
Make \(Q(\mathbb{D})\) an array, sort it and then we have direct access?
We want to access the \(k\)-th element of \(Q(\mathbb{D})\) in the lexicographical order induced by a given order over the variables.
Make \(Q(\mathbb{D})\) an array, sort it and then we have direct access?
city | country | name | id |
---|---|---|---|
Berlin | Germany | Djibril | 4 |
Paris | France | Alice | 1 |
Rome | Italy | Chiara | 3 |
Rome | Italy | Francesca | 6 |
\(Q(\mathbb{D})[4] = (Rome, Italy, Francesca, 6)\)
We want to access the \(k\)-th element of \(Q(\mathbb{D})\) in the lexicographical order induced by a given order over the variables.
Make \(Q(\mathbb{D})\) an array, sort it and then we have direct access?
city | country | name | id |
---|---|---|---|
\(\dots\) | \(\dots\) | \(\dots\) | \(\dots\) |
Berlin | Germany | Djibril | 4 |
\(\dots\) | \(\dots\) | \(\dots\) | \(\dots\) |
Paris | France | Alice | 1 |
\(\dots\) | \(\dots\) | \(\dots\) | \(\dots\) |
Rome | Italy | Chiara | 3 |
Rome | Italy | Francesca | 6 |
\(\dots\) | \(\dots\) | \(\dots\) | \(\dots\) |
\(Q(\mathbb{D})[1432] =\) ??
Precomputation : very costly
Access : nearly free
We need another way to represent the data
Uniform Sampling (w/o repetition)
gives a good idea of what the dataset is like statistically
Answer Enumeration
by accessing every answer in order
In the general case, DA is NP-hard
We need to know if there is a solution :
NP-hard (Chandra, Merlin, 1977)
We need to know how many solutions exist :
#P-hard
Are there cases where the problem is tractable?
\(Q = R_1(x,y) \wedge R_2(y,z) \wedge R_3(z, t) \wedge R_4(t, u) \wedge R_5(u,v)\)
Worst-case: \(\mathcal{O}(|\mathbb{D}|^5)\) preprocessing / \(\mathcal{O}(1)\) access
Path order (\(x y z t u v\)): dynamic programming \(\mathcal{O}(|\mathbb{D}|)\) preprocessing / \(\mathcal{O}(\mathsf{log}|\mathbb{D}|)\) access
We can use similar techniques for DA on more complex queries: acyclic queries
What happens if the order is given?
If the order is set, then we can measure how hard it is for DA
In the previous example:
Order complexity
For a query \(Q\) and an order \(\pi\), related works establish a function \(\iota(\pi, Q)\) (incompatibility number - Bringmann, Carmeli, Mengel, 2022) that computes how hard this order is for DA over \(Q\).
We have:
We can define a tractability measure \(\iota\) such that:
For a query \(Q(x_1,\dots,x_n)\) and an order \(\pi\) on the variables of complexity \(\iota(Q, \pi)\), we can solve DA tasks with:
Carmeli, Tziavelis, Gatterbauer, Kimelfeld, Riedewald, 2020
Bringmann, Carmeli, Mengel, 2022
Signed Join Query: \(Q(x_1, \dots, x_n) = \bigwedge_{i=1}^k P_i(\vec{z_i}) \bigwedge_{i=1}^k \lnot N_i(\vec{z_i})\)
Big difference:
positively encoding \(\lnot N(\vec{z})\) on a domain \(D\) requires \((D^{|\vec{z}|} - \#N)\) tuples
\(x_1\) | \(x_2\) | \(x_3\) |
---|---|---|
0 | 1 | 0 |
\(x_1\) | \(x_2\) | \(x_3\) |
---|---|---|
0 | 0 | 0 |
0 | 0 | 1 |
0 | 1 | 1 |
1 | 0 | 0 |
1 | 0 | 1 |
1 | 1 | 0 |
1 | 1 | 1 |
\(Q_1 = R(1, 2, 3) \land S(1, 2) \land T(2, 3) \land U(3, 1)\)
has linear preprocessing
\(Q_2 = S(1, 2) \land T(2, 3) \land U(3, 1)\)
non-linear preprocessing (triangle)
query should be as hard as its subquery
\(Q_1 =\) \(\lnot R(1, 2, 3)\) \(\land S(1, 2) \land T(2, 3) \land U(3, 1)\)
non-linear preprocessing (DB w/ empty relation)
\(Q_2 = S(1, 2) \land T(2, 3) \land U(3, 1)\)
non-linear preprocessing (triangle)
query should be as hard as its subquery
Preprocessing for an SJQ \(Q\) should be at least that of any \(Q' \subseteq Q\)
Good candidate for a stricter measure of query tractability: the signed hyperorder width (\(\mathsf{show}\))
\(Q = P \land N,\; \bbox[15px, border: 5px solid var(--r-ulille-red)]{\mathsf{show}(\pi, Q) = \mathsf{max}_{N'\subseteq N}f(P\land N', \pi)}\)
\(Q_1 =\) \(\lnot R(1, 2, 3)\) \(\land S(1, 2) \land T(2, 3) \land U(3, 1)\)
With negative big atom: \(2\)
\(Q_2 =\) \(\lnot R(1, 2, 3)\) \(\land T(2, 3) \land U(3, 1)\)
All subqueries have complexity \(1\)
Tractability of DA over SJQs can be expressed with \(\mathsf{show}\):
For a signed query \(Q(x_1,\dots,x_n)\) and an order \(\pi\) on the variables, we can solve DA tasks with:
even with signed queries, we are able to build a good algorithm
\(x_1\) | \(x_2\) | \(x_3\) |
---|---|---|
0 | 0 | 0 |
0 | 0 | 1 |
0 | 1 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 0 | 2 |
1 | 1 | 1 |
1 | 1 | 2 |
1 | 2 | 0 |
1 | 2 | 1 |
2 | 0 | 1 |
2 | 0 | 2 |
2 | 2 | 1 |
2 | 2 | 2 |
factorised representation of relations
circuit with 3 kinds of gates :
paths from decision gates are labelled by the domain values
+ order \(\prec\) on the variables
For \(C\) an ordered relational circuit, we can perform direct access tasks in time \(\mathcal{O}(\mathsf{poly}(|X|)\mathsf{polylog}(|D|))\) after a preprocessing in time \(\mathcal{O}(|C|\cdot\mathsf{poly}(|X|)\mathsf{polylog}(|D|))\)
Idea : for each gate \(v\) over \(x_i\) and for each domain value \(d\)
compute the size of the relation where \(x_i\) is set to a value \(d'\leqslant d\)
Compute the 13th solution \(\to\) 221
\(Q\) a SJQ and \(x_1\prec\dots\prec x_n\) an order over the variable set
\(Q(\mathbb{D}) = \biguplus_{d\in D} Q[x_1 = d](\mathbb{D})\)
\[ \text{if} \begin{cases} Q & = & Q_1 \land Q_2 \\ \mathsf{var}(Q_1) \cap \mathsf{var}(Q_2) & = & \emptyset \end{cases} \]
then \(Q(\mathbb{D}) = Q_1(\mathbb{D}) \times Q_2(\mathbb{D})\)
\(Q\) a CQ and \(x_1\prec\dots\prec x_n\) an order over the variable set
\(Q(\mathbb{D}) = \biguplus_{d\in D} Q[x_1 = d](\mathbb{D})\)
\[ \text{if} \begin{cases} Q & = & Q_1 \land Q_2 \\ \mathsf{var}(Q_1) \cap \mathsf{var}(Q_2) & = & \emptyset \end{cases} \]
then \(Q(\mathbb{D}) = Q_1(\mathbb{D}) \times Q_2(\mathbb{D})\)
recursive implementation + cache \(\implies\) ordered relational circuit computing \(Q(\mathbb{D})\)
Let \(Q\) be an SJQ and \(\pi = x_n,\dots,x_1\) an order of complexity \(\mathsf{show}(Q, \pi)\) for \(Q\).
Exhaustive DPLL on \(Q\), \(\mathbb{D}\) and with order \(\pi\) returns an ordered circuit of size \(\mathcal{O}(\mathsf{poly}(|Q|)|\mathbb{D}|^{\mathsf{show}(Q, \pi)+1})\)
(Generalisation of [Capelli, 2017])
For a query \(Q(x_1,\dots,x_n)\) and an order \(\pi\) on the variables of complexity \(\mathsf{show}(Q, \pi)\), we can solve DA tasks with a preprocessing in time \(\mathcal{O}(|\mathbb{D}|^{\mathsf{show}(Q, \pi) + 1}\mathsf{poly}(|Q|))\) and access in time \(\mathcal{O}(\mathsf{poly}(|Q|)\mathsf{polylog}(|\mathbb{D}|))\).
Algorithm schema:
This technique generalises to: conjunctive (with \(\exists\) quantifiers) signed queries:
Going further with circuits
study the tractability of the circuit approach for DA on CQs with aggregation
\(Q(p, c, g, \mathsf{count()}) = \mathsf{Teams}(p, c) \land \mathsf{Games}(g, c, \cdot) \land \mathsf{Tries}(g, p)\)
How should we integrate the aggregation in the lexicographical order?
How does the aggregation fit into the compiled circuits?
\(\to\) (Eldar, Carmeli, Kimelfeld, 2023)
generalise the circuit approach to queries over annotated databases (FAQ and AJAR queries)
\(\to\) (Zhao, Fan, Ouyang, Koutris, 2023)
work on lower bounds for DA on SJQs
Mostly solved, will appear in a longer version
Shows that the tractability measure is the optimal choice
\(\to\) (Ongoing work with Nofar Carmeli)