Direct Access for Conjunctive Queries with Negation

Florent Capelli^[1], Oliver Irwin^[2]

25/03/2024 - ICDT 2024

[1] - CRIL / Université d’Artois

[2] - CRIStAL / Université de Lille

Direct Access

Context

Join Query : \(Q(x_1, \dots, x_n) = \bigwedge_{i=1}^k R_i(\vec{z_i})\)

where \(\vec{z_i}\) is a tuple over \(X = \{x_1,\dots,x_n\}\)

Example: \(Q(city, country, name, id) = People(id, name, city) \wedge Capitals(city, country)\)

People
id	name	city
1	Alice	Paris
2	Bob	Lens
3	Chiara	Rome
4	Djibril	Berlin
5	Émile	Dortmund
6	Francesca	Rome

Capitals
city	country
Berlin	Germany
Paris	France
Rome	Italy

\(Q(\mathbb{D})\)
city	country	name	id
Paris	France	Alice	1
Rome	Italy	Chiara	3
Berlin	Germany	Djibril	4
Rome	Italy	Francesca	6

Direct Access

We want to access the \(k\)-th element of \(Q(\mathbb{D})\) in the lexicographical order induced by a given order over the variables.

Make \(Q(\mathbb{D})\) an array, sort it and then we have direct access?

Direct Access

We want to access the \(k\)-th element of \(Q(\mathbb{D})\) in the lexicographical order induced by a given order over the variables.

Make \(Q(\mathbb{D})\) an array, sort it and then we have direct access?

\(Q(\mathbb{D})\)
city	country	name	id
Berlin	Germany	Djibril	4
Paris	France	Alice	1
Rome	Italy	Chiara	3
Rome	Italy	Francesca	6

\(Q(\mathbb{D})[4] = (Rome, Italy, Francesca, 6)\)

Direct Access

We want to access the \(k\)-th element of \(Q(\mathbb{D})\) in the lexicographical order induced by a given order over the variables.

Make \(Q(\mathbb{D})\) an array, sort it and then we have direct access?

\(Q(\mathbb{D})\)
city	country	name	id
\(\dots\)	\(\dots\)	\(\dots\)	\(\dots\)
Berlin	Germany	Djibril	4
\(\dots\)	\(\dots\)	\(\dots\)	\(\dots\)
Paris	France	Alice	1
\(\dots\)	\(\dots\)	\(\dots\)	\(\dots\)
Rome	Italy	Chiara	3
Rome	Italy	Francesca	6
\(\dots\)	\(\dots\)	\(\dots\)	\(\dots\)

\(Q(\mathbb{D})[1432] =\) ??

Precomputation : very costly

Access : nearly free

We need another way to represent the data

Applications

Uniform Sampling (w/o repetition)

gives a good idea of what the dataset is like statistically

Answer Enumeration

by accessing every answer in order

Tractable Join Queries

Complexity of Direct Access

In the general case, DA is NP-hard

We need to know if there is a solution :

NP-hard (Chandra, Merlin, 1977)

We need to know how many solutions exist :

#P-hard

Are there cases where the problem is tractable?

A tractable example

\(Q = R_1(x,y) \wedge R_2(y,z) \wedge R_3(z, t) \wedge R_4(t, u) \wedge R_5(u,v)\)

Worst-case: \(\mathcal{O}(|\mathbb{D}|^5)\) preprocessing / \(\mathcal{O}(1)\) access

Path order (\(x y z t u v\)): dynamic programming \(\mathcal{O}(|\mathbb{D}|)\) preprocessing / \(\mathcal{O}(\mathsf{log}|\mathbb{D}|)\) access

We can use similar techniques for DA on more complex queries: acyclic queries

What about the order?

What happens if the order is given?

If the order is set, then we can measure how hard it is for DA

In the previous example:

\(xyztuv\) (path order): linear preprocessing time
\(yx\)\(ztuv\) (or any order w/ an inversion): quadratic preprocessing time

What about the order?

Order complexity

For a query \(Q\) and an order \(\pi\), related works establish a function \(\iota(\pi, Q)\) (incompatibility number - Bringmann, Carmeli, Mengel, 2022) that computes how hard this order is for DA over \(Q\).

We have:

an upper bound: an algorithm that does this
a lower bound: a more reasonable problem we can reduce to

Tractable Queries

We can define a tractability measure \(\iota\) such that:

For a query \(Q(x_1,\dots,x_n)\) and an order \(\pi\) on the variables of complexity \(\iota(Q, \pi)\), we can solve DA tasks with:

\(\mathcal{O}(|\mathbb{D}|^{\iota(Q, \pi)}\mathsf{poly}(|Q|))\) preprocessing; and
\(\mathcal{O}(\mathsf{poly}(|Q|)\mathsf{polylog}(|\mathbb{D}|))\) access time.

Carmeli, Tziavelis, Gatterbauer, Kimelfeld, Riedewald, 2020

Bringmann, Carmeli, Mengel, 2022

Signed Queries

Definition

Signed Join Query: \(Q(x_1, \dots, x_n) = \bigwedge_{i=1}^k P_i(\vec{z_i}) \bigwedge_{i=1}^k \lnot N_i(\vec{z_i})\)

Big difference:

positively encoding \(\lnot N(\vec{z})\) on a domain \(D\) requires \((D^{|\vec{z}|} - \#N)\) tuples

\(N_i\)
\(x_1\)	\(x_2\)	\(x_3\)
0	1	0

\(\lnot N_i\)
\(x_1\)	\(x_2\)	\(x_3\)
0	0	0
0	0	1
0	1	1
1	0	0
1	0	1
1	1	0
1	1	1

Tractability of SJQ

\(Q_1 = R(1, 2, 3) \land S(1, 2) \land T(2, 3) \land U(3, 1)\)

has linear preprocessing

\(Q_2 = S(1, 2) \land T(2, 3) \land U(3, 1)\)

non-linear preprocessing (triangle)

query should be as hard as its subquery

Tractability of SJQ

\(Q_1 =\) \(\lnot R(1, 2, 3)\) \(\land S(1, 2) \land T(2, 3) \land U(3, 1)\)

non-linear preprocessing (DB w/ empty relation)

\(Q_2 = S(1, 2) \land T(2, 3) \land U(3, 1)\)

non-linear preprocessing (triangle)

query should be as hard as its subquery

stricter tractability criteria

Preprocessing for an SJQ \(Q\) should be at least that of any \(Q' \subseteq Q\)

Good candidate for a stricter measure of query tractability: the signed hyperorder width (\(\mathsf{show}\))

\(Q = P \land N,\; \bbox[15px, border: 5px solid var(--r-ulille-red)]{\mathsf{show}(\pi, Q) = \mathsf{max}_{N'\subseteq N}f(P\land N', \pi)}\)

Example queries

\(Q_1 =\) \(\lnot R(1, 2, 3)\) \(\land S(1, 2) \land T(2, 3) \land U(3, 1)\)

With negative big atom: \(2\)

\(Q_2 =\) \(\lnot R(1, 2, 3)\) \(\land T(2, 3) \land U(3, 1)\)

All subqueries have complexity \(1\)

Direct Access for tractable queries

Tractability of DA over SJQs can be expressed with \(\mathsf{show}\):

For a signed query \(Q(x_1,\dots,x_n)\) and an order \(\pi\) on the variables, we can solve DA tasks with:

\(\mathcal{O}(|\mathbb{D}|^{\mathsf{show}(Q, \pi)}\mathsf{poly}(|Q|))\) preprocessing; and
\(\mathcal{O}(\mathsf{poly}(|Q|)\mathsf{polylog}(|\mathbb{D}|))\) access time.

even with signed queries, we are able to build a good algorithm

Our Algorithm: A circuit approach

Relational Circuits

\(x_1\)	\(x_2\)	\(x_3\)
0	0	0
0	0	1
0	1	0
0	1	1
1	0	1
1	0	2
1	1	1
1	1	2
1	2	0
1	2	1
2	0	1
2	0	2
2	2	1
2	2	2

Ordered Relational Circuits

factorised representation of relations

circuit with 3 kinds of gates :

inputs : \(\top\) & \(\bot\)
decision gates
\(\times\)-gates

paths from decision gates are labelled by the domain values

+ order \(\prec\) on the variables

Ordered Relational Circuits

For \(C\) an ordered relational circuit, we can perform direct access tasks in time \(\mathcal{O}(\mathsf{poly}(|X|)\mathsf{polylog}(|D|))\) after a preprocessing in time \(\mathcal{O}(|C|\cdot\mathsf{poly}(|X|)\mathsf{polylog}(|D|))\)

Preprocessing

Idea : for each gate \(v\) over \(x_i\) and for each domain value \(d\)

compute the size of the relation where \(x_i\) is set to a value \(d'\leqslant d\)

Preprocessing

Direct Access

Compute the 13^th solution \(\to\) 221

From JQ to circuit

\(Q\) a SJQ and \(x_1\prec\dots\prec x_n\) an order over the variable set

\(Q(\mathbb{D}) = \biguplus_{d\in D} Q[x_1 = d](\mathbb{D})\)

\[ \text{if} \begin{cases} Q & = & Q_1 \land Q_2 \\ \mathsf{var}(Q_1) \cap \mathsf{var}(Q_2) & = & \emptyset \end{cases} \]

then \(Q(\mathbb{D}) = Q_1(\mathbb{D}) \times Q_2(\mathbb{D})\)

From CQ to circuit

\(Q\) a CQ and \(x_1\prec\dots\prec x_n\) an order over the variable set

\(Q(\mathbb{D}) = \biguplus_{d\in D} Q[x_1 = d](\mathbb{D})\)

\[ \text{if} \begin{cases} Q & = & Q_1 \land Q_2 \\ \mathsf{var}(Q_1) \cap \mathsf{var}(Q_2) & = & \emptyset \end{cases} \]

then \(Q(\mathbb{D}) = Q_1(\mathbb{D}) \times Q_2(\mathbb{D})\)

recursive implementation + cache \(\implies\) ordered relational circuit computing \(Q(\mathbb{D})\)

Compiling Signed Queries

Let \(Q\) be an SJQ and \(\pi = x_n,\dots,x_1\) an order of complexity \(\mathsf{show}(Q, \pi)\) for \(Q\).

Exhaustive DPLL on \(Q\), \(\mathbb{D}\) and with order \(\pi\) returns an ordered circuit of size \(\mathcal{O}(\mathsf{poly}(|Q|)|\mathbb{D}|^{\mathsf{show}(Q, \pi)+1})\)

(Generalisation of [Capelli, 2017])

Recap

For a query \(Q(x_1,\dots,x_n)\) and an order \(\pi\) on the variables of complexity \(\mathsf{show}(Q, \pi)\), we can solve DA tasks with a preprocessing in time \(\mathcal{O}(|\mathbb{D}|^{\mathsf{show}(Q, \pi) + 1}\mathsf{poly}(|Q|))\) and access in time \(\mathcal{O}(\mathsf{poly}(|Q|)\mathsf{polylog}(|\mathbb{D}|))\).

Algorithm schema:

compile an ordered relational circuit \(C\) computing \(Q(\mathbb{D})\) ;
annotate the gates with the number of solutions ;
top-down induction to answer \(Q(\mathbb{D})[k]\).

Going Further

Generalising to conjunctive queries

This technique generalises to: conjunctive (with \(\exists\) quantifiers) signed queries:

project \(\exists\) directly on the circuit
as long as the projection is on a suffix

Next steps

Going further with circuits

study the tractability of the circuit approach for DA on CQs with aggregation

\(Q(p, c, g, \mathsf{count()}) = \mathsf{Teams}(p, c) \land \mathsf{Games}(g, c, \cdot) \land \mathsf{Tries}(g, p)\)

How should we integrate the aggregation in the lexicographical order?

How does the aggregation fit into the compiled circuits?

\(\to\) (Eldar, Carmeli, Kimelfeld, 2023)

generalise the circuit approach to queries over annotated databases (FAQ and AJAR queries)

\(\to\) (Zhao, Fan, Ouyang, Koutris, 2023)

work on lower bounds for DA on SJQs

Mostly solved, will appear in a longer version

Shows that the tractability measure is the optimal choice

\(\to\) (Ongoing work with Nofar Carmeli)