Direct Access for Conjunctive Queries with Negation

Florent Capelli[1], Oliver Irwin[2]

21/03/2024 - SISE Seminar

[1] - CRIL / Université d’Artois

[2] - CRIStAL / Université de Lille

Direct Access

Context

Join Query : \(Q(x_1, \dots, x_n) = \bigwedge_{i=1}^k R_i(\vec{z_i})\)

where \(\vec{z_i}\) is a tuple over \(X = \{x_1,\dots,x_n\}\)

Example: \(Q(city, country, name, id) = People(id, name, city) \wedge Capitals(city, country)\)

People
id name city
1 Alice Paris
2 Bob Lens
3 Chiara Rome
4 Djibril Berlin
5 Émile Dortmund
6 Francesca Rome
Capitals
city country
Berlin Germany
Paris France
Rome Italy
\(Q(\mathbb{D})\)
city country name id
Paris France Alice 1
Rome Italy Chiara 3
Berlin Germany Djibril 4
Rome Italy Francesca 6

Direct Access

We want to access the \(k\)-th element of \(Q(\mathbb{D})\) for a given order.

Make \(Q(\mathbb{D})\) an array, sort it and then we have direct access?

Direct Access

We want to access the \(k\)-th element of \(Q(\mathbb{D})\) for a given order.

Make \(Q(\mathbb{D})\) an array, sort it and then we have direct access?

\(Q(\mathbb{D})\)
city country name id
Berlin Germany Djibril 4
Paris France Alice 1
Rome Italy Chiara 3
Rome Italy Francesca 6

\(Q(\mathbb{D})[4] = (Rome, Italy, Francesca, 6)\)

Direct Access

We want to access the \(k\)-th element of \(Q(\mathbb{D})\) for a given order.

Make \(Q(\mathbb{D})\) an array, sort it and then we have direct access?

\(Q(\mathbb{D})\)
city country name id
\(\dots\) \(\dots\) \(\dots\) \(\dots\)
Berlin Germany Djibril 4
\(\dots\) \(\dots\) \(\dots\) \(\dots\)
Paris France Alice 1
\(\dots\) \(\dots\) \(\dots\) \(\dots\)
Rome Italy Chiara 3
Rome Italy Francesca 6
\(\dots\) \(\dots\) \(\dots\) \(\dots\)

\(Q(\mathbb{D})[1432] =\) ??

Precomputation : very costly

Access : nearly free

We need another way to represent the data

Applications

Uniform Sampling (w/o repetition)

gives a good idea of what the dataset is like statistically

Answer Enumeration

by accessing every answer in order

Unifies existing results

Tractable Join Queries

Complexity of Direct Access

In the general case, DA is NP-hard

We need to know if there is a solution :

NP-hard (Chandra, Merlin, 1977)

We need to know how many solutions exist :

#P-hard

Are there cases where the problem is tractable?

A tractable example

\(Q = R_1(x,y) \wedge R_2(y,z) \wedge R_3(z, t) \wedge R_4(t, u) \wedge R_5(u,v)\)

Array (worst-case) size: \(\mathcal{O}(|\mathbb{D}|^5)\)

Path order (\(x y z t u v\)): \(\mathcal{O}(|\mathbb{D}|)\)

We can use similar techniques for DA on more complex queries: acyclic queries

What about the order?

What happens if the order is given?

If the order is set, then we can measure its complexity

In the previous example:

  • \(xyztuv\) (path order): linear complexity
  • \(yx\)\(ztuv\) (or any order w/ an inversion): quadratic complexity

Order complexity

For a query \(Q\) and an order \(\pi\), we have a function \(f(\pi, Q)\) that computes the complexity of the order for \(Q\).

We have:

  • an upper bound: an algorithm that does this
  • a lower bound: a more reasonable problem we can reduce to

Tractable Queries

For a query \(Q(x_1,\dots,x_n)\) and an order \(\pi\) on the variables of complexity \(f(\pi)\), we can solve DA tasks with:

  • \(\mathcal{O}(|\mathbb{D}|^{f(\pi)}\mathsf{poly}(|Q|))\) preprocessing; and
  • \(\mathcal{O}(\mathsf{poly}(|Q|)\mathsf{polylog}(|\mathbb{D}|))\) access time.

Signed Queries

Definition

Signed Join Query: \(Q(x_1, \dots, x_n) = \bigwedge_{i=1}^k P_i(\vec{z_i}) \bigwedge_{i=1}^k \lnot N_i(\vec{z_i})\)

Big difference:

positively encoding \(\lnot N(\vec{z})\) on a domain \(D\) requires \((D^{|\vec{z}|} - \#N)\) tuples

\(N_i\)
\(x_1\) \(x_2\) \(x_3\)
0 1 0
\(\lnot N_i\)
\(x_1\) \(x_2\) \(x_3\)
0 0 0
0 0 1
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1

Tractability of NJQ

is tractable

is not tractable

this measure does not suffice as it is not monotonous

stricter acyclicity

Good candidate for a stricter measure of query tractability:

take into account every subquery and take the worst possible complexity

Example queries

With only positive atoms: \(1\)

With negative big atom: \(2\)

All subqueries have complexity \(1\)

Direct Access for tractable queries

For a query \(Q(x_1,\dots,x_n)\) and an order \(\pi\) on the variables of complexity \(f(\pi)\), we can solve DA tasks with:

  • \(\mathcal{O}(|\mathbb{D}|^{f(\pi)}\mathsf{poly}(|Q|))\) preprocessing; and
  • \(\mathcal{O}(\mathsf{poly}(|Q|)\mathsf{polylog}(|\mathbb{D}|))\) access time.

even with signed queries, we are able to build a good algorithm

A Circuit Approach to Direct Access

Relational Circuits

\(x_1\) \(x_2\) \(x_3\)
0 0 0
0 0 1
0 1 0
0 1 1
1 0 1
1 0 2
1 1 1
1 1 2
1 2 0
1 2 1
2 0 1
2 0 2
2 2 1
2 2 2

Ordered Relational Circuits

factorised representation of relations

circuit with 3 kinds of gates :

  • inputs : \(\top\) & \(\bot\)
  • decision gates
  • \(\times\)-gates

paths from decision gates are labelled by the domain values

+ order \(\prec\) on the variables

Ordered Relational Circuits

For \(C\) an ordered relational circuit, we can perform direct access tasks in time \(\mathcal{O}(\mathsf{poly}(|X|)\mathsf{polylog}(|D|))\) after a preprocessing in time \(\mathcal{O}(|C|\cdot\mathsf{poly}(|X|)\mathsf{polylog}(|D|))\)

Preprocessing

Idea : for each gate \(v\) over \(x_i\) and for each domain value \(d\)

compute the size of the relation where \(x_i\) is set to a value \(d'\leqslant d\)

Preprocessing

Direct Access

Compute the 13th solution \(\to\) 221

From CQ to circuit

\(Q\) a CQ and \(x_1\prec\dots\prec x_n\) an order over the variable set

\(Q(\mathbb{D}) = \biguplus_{d\in D} Q[x_1 = d](\mathbb{D})\)

\[ \text{if} \begin{cases} Q & = & Q_1 \land Q_2 \\ \mathsf{var}(Q_1) \cap \mathsf{var}(Q_2) & = & \emptyset \end{cases} \]

then \(Q(\mathbb{D}) = Q_1(\mathbb{D}) \times Q_2(\mathbb{D})\)

From CQ to circuit

\(Q\) a CQ and \(x_1\prec\dots\prec x_n\) an order over the variable set

\(Q(\mathbb{D}) = \biguplus_{d\in D} Q[x_1 = d](\mathbb{D})\)

\[ \text{if} \begin{cases} Q & = & Q_1 \land Q_2 \\ \mathsf{var}(Q_1) \cap \mathsf{var}(Q_2) & = & \emptyset \end{cases} \]

then \(Q(\mathbb{D}) = Q_1(\mathbb{D}) \times Q_2(\mathbb{D})\)

recursive implementation + cache \(\implies\) ordered relational circuit computing \(Q(\mathbb{D})\)

Compiling Signed Queries

Let \(Q\) be an SJQ and \(\pi = x_n,\dots,x_1\) an order of complexity \(f(\pi)\) for \(Q\).

Exhaustive DPLL on \(Q\), \(\mathbb{D}\) and with order \(\pi\) returns an ordered circuit of size \(\mathcal{O}(\mathsf{poly}(|Q|)|\mathbb{D}|^{f(\pi)+1})\).

(Generalisation of [Capelli, 2017])

Recap

For a query \(Q(x_1,\dots,x_n)\) and an order \(\pi\) on the variables of complexity \(f(\pi)\), we can solve DA tasks with a preprocessing in time \(\mathcal{O}(|\mathbb{D}|^{f(\pi)}\mathsf{poly}(|Q|))\) and access in time \(\mathcal{O}(\mathsf{poly}(|Q|)\mathsf{polylog}(|\mathbb{D}|))\).

Algorithm schema:

  1. compile an ordered relational circuit \(C\) computing \(Q(\mathbb{D})\) ;
  2. annotate the gates with the number of solutions ;
  3. top-down induction to answer \(Q(\mathbb{D})[k]\).

Going Further

Generalising to conjunctive queries

This technique generalises to:

  1. conjunctive (with \(\exists\) quantifiers) signed queries:
    • project \(\exists\) directly on the circuit
    • as long as the projection is on a suffix

Next steps

Going further with circuits

study the tractability of the circuit approach for DA on CQs with aggregation

\(Q(p, c, g, \mathsf{count()}) = \mathsf{Teams}(p, c) \land \mathsf{Games}(g, c, \cdot) \land \mathsf{Tries}(g, p)\)

How should we integrate the aggregation in the lexicographical order?

How does the aggregation fit into the compiled circuits?

\(\to\) (Eldar, Carmeli, Kimelfeld, 2023)

generalise the circuit approach to queries over annotated databases (FAQ and AJAR queries)

\(\to\) (Zhao, Fan, Ouyang, Koutris, 2023)