Florent Capelli[1], Oliver Irwin[2]
11/12/2023 - BOREAL Seminar
[1] - CRIL / Université d’Artois
[2] - CRIStAL / Université de Lille
Join Query : \(Q(x_1, \dots, x_n) = \bigwedge_{i=1}^k R_i(\vec{z_i})\)
where \(\vec{z_i}\) is a tuple over \(X = \{x_1,\dots,x_n\}\)
Example: \(Q(city, country, name, id) = People(id, name, city) \wedge Capitals(city, country)\)
id | name | city |
---|---|---|
1 | Alice | Paris |
2 | Bob | Lens |
3 | Chiara | Rome |
4 | Djibril | Berlin |
5 | Émile | Dortmund |
6 | Francesca | Rome |
city | country |
---|---|
Berlin | Germany |
Paris | France |
Rome | Italy |
city | country | name | id |
---|---|---|---|
Paris | France | Alice | 1 |
Rome | Italy | Chiara | 3 |
Berlin | Germany | Djibril | 4 |
Rome | Italy | Francesca | 6 |
We want to access the \(k\)-th element of \(Q(\mathbb{D})\) for a given order.
Make \(Q(\mathbb{D})\) an array, sort it and then we have direct access?
We want to access the \(k\)-th element of \(Q(\mathbb{D})\) for a given order.
Make \(Q(\mathbb{D})\) an array, sort it and then we have direct access?
city | country | name | id |
---|---|---|---|
Berlin | Germany | Djibril | 4 |
Paris | France | Alice | 1 |
Rome | Italy | Chiara | 3 |
Rome | Italy | Francesca | 6 |
\(Q(\mathbb{D})[4] = (Rome, Italy, Francesca, 6)\)
We want to access the \(k\)-th element of \(Q(\mathbb{D})\) for a given order.
Make \(Q(\mathbb{D})\) an array, sort it and then we have direct access?
city | country | name | id |
---|---|---|---|
\(\dots\) | \(\dots\) | \(\dots\) | \(\dots\) |
Berlin | Germany | Djibril | 4 |
\(\dots\) | \(\dots\) | \(\dots\) | \(\dots\) |
Paris | France | Alice | 1 |
\(\dots\) | \(\dots\) | \(\dots\) | \(\dots\) |
Rome | Italy | Chiara | 3 |
Rome | Italy | Francesca | 6 |
\(\dots\) | \(\dots\) | \(\dots\) | \(\dots\) |
\(Q(\mathbb{D})[1432] =\) ??
Precomputation : very costly
Access : nearly free
We need another way to represent the data
Uniform Sampling (w/o repetition)
gives a good idea of what the dataset is like statistically
Answer Enumeration
by accessing every answer in order
Unifies existing results
In the general case, DA is NP-hard
We need to know if there is a solution :
NP-hard (Chandra, Merlin, 1977)
We need to know how many solutions exist :
#P-hard
\(Q = R_1(x,y) \wedge R_2(y,z) \wedge R_3(z, t) \wedge R_4(t, u) \wedge R_5(u,v)\)
Array (worst-case) size: \(\mathcal{O}(|\mathbb{D}|^k)\)
Path order (\(x y z t u v\)): \(\mathcal{O}(|\mathbb{D}|)\)
More complex queries exhibit similar behaviour: acyclic queries
Central class of queries because of their tractability
\(Q = R_1(x,y,z) \wedge R_2(x,z,u) \wedge R_3(x,y,t) \wedge R_4(y,t) \wedge R_5(y,v)\)
\(Q = R_1(x,y,z) \wedge R_2(x,z,u) \wedge R_3(x,y,t) \wedge R_4(y,t) \wedge R_5(y,v)\)
The order used here is \((x, y, z, t, u, v)\).
Load data inside the bags
Annotate by computing the number of extensions
Annotate by computing the number of extensions
\(Q(\mathbb{D})[3]\) must set \(x\gets 2, y \gets 1, z \gets 0\), then proceed down in the tree.
if \(Q\) is acyclic, then there exists a tractable order for direct access
What happens if the order is given?
An \(\alpha\)-leaf in a query \(Q\) is a variable \(x\) such that the neighbourhood \(N(x)\) of \(x\) is covered by an atom.
1 is an \(\alpha\)-leaf
2 is an \(\alpha\)-leaf
3 is an \(\alpha\)-leaf
4 is an \(\alpha\)-leaf
A query \(Q\) is \(\alpha\)-acyclic iff one can obtain \(\emptyset\) by successively removing \(\alpha\)-leaves in \(Q\). This induces an order on \(V\) called an \(\alpha\)-elimination order.
[Brault-Baron, 2014], also known as “without disruptive trio” [Carmeli, Tziavelis, Gatterbauer, Kimelfeld, Riedewald, 2020]
In the previous example, 1, 2, 3, 4 is an \(\alpha\)-elimination order
Given a join query \(Q(x_1,\dots,x_n)\), if \(x_n, \dots, x_1\) is an \(\alpha\)-elimination order then we can answer direct access queries with precomputation time \(\mathcal{O}(|\mathbb{D}|\mathsf{poly}(|Q|))\) and access time \(\mathcal{O}(\mathsf{poly}(|Q|)\mathsf{polylog}(|\mathbb{D}|))\).
[Carmeli, Tziavelis, Gatterbauer, Kimelfeld, Riedewald, 2020]
Algorithm schema
We want to access the \(k\)-th solution to a query for a given database
Make a table, sort it, and done?
city | country | name | id |
---|---|---|---|
Berlin | Germany | Djibril | 4 |
Paris | France | Alice | 1 |
Rome | Italy | Chiara | 3 |
Rome | Italy | Francesca | 6 |
We need another solution 😢
Join Tree Approach
use a join tree to answer tasks in an
efficient way
Works for \(\alpha\)-acyclic queries 🥳
Negative Join Query: \(Q(x_1, \dots, x_n) = \bigwedge_{i=1}^k \lnot R_i(\vec{z_i})\)
Big difference:
positively encoding \(\lnot R(\vec{z})\) on a domain \(D\) requires \((D^{|\vec{z}|} - \#R)\) tuples
\(x_1\) | \(x_2\) | \(x_3\) |
---|---|---|
0 | 1 | 0 |
\(x_1\) | \(x_2\) | \(x_3\) |
---|---|---|
0 | 0 | 0 |
0 | 0 | 1 |
0 | 1 | 1 |
1 | 0 | 0 |
1 | 0 | 1 |
1 | 1 | 0 |
1 | 1 | 1 |
Let \(Q'\) be any query and consider \(Q = Q'(x_1,\dots,x_n) \wedge \lnot R(x_1,\dots,x_n)\).
For any database \(\mathbb{D}\) such that \(R^\mathbb{D}= \emptyset\), we have \(Q(\mathbb{D}) = Q'(\mathbb{D})\)
\(\implies\) if \(Q\) is tractable, \(Q'\) is tractable
But \(Q\) is \(\alpha\)-acyclic and \(Q'\) is not restricted
is acyclic
is not acyclic
\(\alpha\)-acyclicity does not suffice as it is not monotonous
Good candidate for another measure of tractability: every \(Q' \subseteq Q\) is \(\alpha\)-acyclic.
This is known as \(\beta\)-acyclicity
This is not a notion that is easy to work with, how can we exploit it?
A \(\beta\)-leaf is a variable \(x\) such that all the atoms that include \(x\) are contained in one another.
Characterisation: A query \(Q\) is \(\beta\)-acyclic iff one can obtain \(\emptyset\) by successively removing \(\beta\)-leaves in \(Q\). This induces an order on \(V\) called an \(\beta\)-elimination order.
Intuition: a \(\beta\)-elimination order is an order that is an \(\alpha\)-elimination order for every subquery
is not \(\beta\)-acyclic
is \(\beta\)-acyclic
Direct Access for \(\beta\)-acyclic NJQ with \(\mathcal{O}(\mathsf{poly}(|\mathbb{D}|))\) preprocessing and access time \(\mathcal{O}(\mathsf{polylog}(|\mathbb{D}|)\mathsf{poly}(|Q|))\) for lexicographical orders based on (reversed) \(\beta\)-elimination orders.
Side Note:
Join-Tree based approaches fail for \(\beta\)-acyclic NJQs
\(x_1\) | \(x_2\) | \(x_3\) |
---|---|---|
0 | 0 | 0 |
0 | 0 | 1 |
0 | 1 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 0 | 2 |
1 | 1 | 1 |
1 | 1 | 2 |
1 | 2 | 0 |
1 | 2 | 1 |
2 | 0 | 1 |
2 | 0 | 2 |
2 | 2 | 1 |
2 | 2 | 2 |
factorised representation of relations
circuit with 3 kinds of gates :
paths from decision gates are labelled by the domain values
factorised representation of relations
circuit with 3 kinds of gates :
paths from decision gates are labelled by the domain values
+ order \(\prec\) on the variables
For \(C\) an ordered relational circuit, we can perform direct access tasks in time \(\mathcal{O}(\mathsf{poly}(|X|)\mathsf{polylog}(|D|)\) after a preprocessing in time \(\mathcal{O}(|C|\cdot\mathsf{poly}(|X|)\mathsf{polylog}(|D|))\)
Idea : for each gate \(v\) over \(x_i\) and for each domain value \(d\)
compute the size of the relation where \(x_i\) is set to a value \(d'\leqslant d\)
Compute the 7th solution \(\to\) 111
Compute the 13th solution \(\to\) 221
\(Q\) a CQ and \(x_1\prec\dots\prec x_n\) an order over the variable set
\(Q(\mathbb{D}) = \biguplus_{d\in D} Q[x_1 = d](\mathbb{D})\)
\[ \text{if} \begin{cases} Q & = & Q_1 \land Q_2 \\ \mathsf{var}(Q_1) \cap \mathsf{var}(Q_2) & = & \emptyset \end{cases} \]
then \(Q(\mathbb{D}) = Q_1(\mathbb{D}) \times Q_2(\mathbb{D})\)
\(Q\) a CQ and \(x_1\prec\dots\prec x_n\) an order over the variable set
\(Q(\mathbb{D}) = \biguplus_{d\in D} Q[x_1 = d](\mathbb{D})\)
\[ \text{if} \begin{cases} Q & = & Q_1 \land Q_2 \\ \mathsf{var}(Q_1) \cap \mathsf{var}(Q_2) & = & \emptyset \end{cases} \]
then \(Q(\mathbb{D}) = Q_1(\mathbb{D}) \times Q_2(\mathbb{D})\)
recursive implementation + cache \(\implies\) ordered relational circuit computing \(Q(\mathbb{D})\)
Let \(Q\) be an NJQ and \(x_n,\dots,x_1\) a \(\beta\)-elimination order for \(Q\). Exhaustive DPLL on \(Q\), \(\mathbb{D}\) and with order \(x_1,\dots,x_n\) returns an ordered circuit of size \(\mathcal{O}(\mathsf{poly}(|Q|)\mathsf{poly}(|\mathbb{D}|))\).
(Generalisation of [Capelli, 2017])
For a query \(Q(x_1,\dots,x_n)\) and an order on the variables of “complexity” \(k\), we can solve DA tasks with a preprocessing in time \(\mathcal{O}(|\mathbb{D}|^k\mathsf{poly}(|Q|))\) and access in time \(\mathcal{O}(\mathsf{poly}(|Q|)\mathsf{polylog}(|\mathbb{D}|))\).
Algorithm schema:
For a query \(Q(x_1,\dots,x_n)\) and an order on the variables of “complexity” \(k\), we can solve DA tasks with a preprocessing in time \(\mathcal{O}(|\mathbb{D}|^k\mathsf{poly}(|Q|))\) and access in time \(\mathcal{O}(\mathsf{poly}(|Q|)\mathsf{polylog}(|\mathbb{D}|))\).
Algorithm schema:
We want to access the \(k\)-th solution to a query for a given database
Make a table, sort it, and done?
city | country | name | id |
---|---|---|---|
Berlin | Germany | Djibril | 4 |
Paris | France | Alice | 1 |
Rome | Italy | Chiara | 3 |
Rome | Italy | Francesca | 6 |
We need another solution 😢
Join Tree Approach
Works for positive \(\alpha\)-acyclic queries 🥳
No notion of join tree for \(\beta\)-acyclic negative queries
😢
We propose a new approach!
Recovers former results 🥳
Handles negative queries 😍
This technique generalises to:
Going further with circuits
study the tractability of the circuit approach for DA on CQs with aggregation
\(Q(p, c, g, \mathsf{count()}) = \mathsf{Teams}(p, c) \land \mathsf{Games}(g, c, \cdot) \land \mathsf{Tries}(g, p)\)
How should we integrate the aggregation in the lexicographical order?
How does the aggregation fit in to the compiled circuits?
\(\to\) (Eldar, Carmeli, Kimelfeld, 2023)
generalise the circuit approach to queries over annotated databases (FAQ and AJAR queries)
\(\to\) (Zhao, Fan, Ouyang, Koutris, 2023)