Consider a query with only natural joins such as the SQL query below,
SELECT * FROM r1 JOIN r2 JOIN r3 JOIN r4
$\Leftrightarrow \quad R(a, b, c, d, e) = R_1(a, b), R_2(b, c), R_3(c, d), R_4(d, e)$
By regarding each relation, such as $R_1(a, b)$, as a vertex and establishing an edge between vertices if they share some common variables, such a edge between $R_1(a, b), R_2(b, c)$, we find a graph representation of the query. The example above admits a chain of 4 vertices.
\[R_1 - R_2 - R_3 - R_4\]In addition to the chain, we have also,
Thomas Neumann’s paper evaluates two existing Dynamic Programming algorithms that look for the optimal join tree (query plan) by explioting the graph presentation with the focus on the # of (sub-)query plans enumerated.
The authors first introduce two key concepts (acronyms),
Considering the 4-relation chain query above, $R_1(a, b), R_2(b, c)$ is a csg, but $R_1(a, b), R_3(c, d)$ is not.
A ccp is a pair of csgs that cover all vertices without sharing any in common.
For example,
Now, the # of (sub-)query plans enumerated is exactly the # of ccps enumerated.
The first algorithm $DPsize$ builds the following DP table in a bottom-up manner. Given the relations, it find the optimal. The algo derives the optimal query plan of the sub-query highlighted in red, $R_2,R_3,R_4$, by checking its sub-queries highlighted in pink below, such as $R_3,R_4$ and $R_2$. The sub-queries in grey, like $R_2,R_4$, are not csg, for they are not connected. The algo avoids such cartesian products.
$R_1,R_2,R_3,R_4$ | |||||
$R_1,R_2,R_3$ | $R_2,R_3,R_4$ | $R_3,R_4,R_1$ | |||
$R_1,R_2$ | $R_2,R_3$ | $R_3,R_4$ | $R_4,R_1$ | $R_1,R_3$ | $R_2,R_4$ |
$R_1$ | $R_2$ | $R_3$ | $R_4$ |
The second algorithm $DPsub$ uses a bitvector of size $n$ to represent the queries contained in a subquery and builds the DP from left to right.
0001 | 0010 | 0011 | 0100 | 0101 | 0110 | 0111 | … |
---|---|---|---|---|---|---|---|
$R_1$ | $R_2$ | $R_1, R_2$ | $R_3$ | $R_1, R_3$ | $R_2, R_3$ | $R_1, R_2, R_3$ | … |
The authors conclude that $DPsub$ works better when the search space is dense, such as for clique queries, where most partitions of the graph tend to be connected, thus csg.
$DPsize$ works better when the searching space is sparse, such as for chain queries.
The authors propose a universal solution $DPccp$ superior to $DPsize$ and $DPsub$ by enumerating only the valid ccps without iterating over the duplicates or cartesian (cross) products.
$DPccp$ requires a breath-first numbering of all vertices (relations) before it explores the neighborhood of each vertex following a descending order their BFS indices. The neighborhood of a vertex is limited to itself and the vertices with larger BFS indices.
Let’s bring up once more the chain query example in the beginning, $R(a, b, c, d, e) = R_1(a, b), R_2(b, c), R_3(c, d), R_4(d, e)$.
We rename the relations with breath-first numbering starting from the original $R_2(b, c)$ with the resulting query shown below.
\[R(a, b, c, d, e) = R'_2(a, b), R'_1(b, c), R'_3(c, d), R'_4(d, e)\]For simplicity, we write $R_i’$ as $R_i$. The chain graph is now as follows.
\[R_2 - R_1 - R_3 - R_4\]$DPccp$ first the neighborhood of the vertex with the largest BF number, namely $R_4$, while masking out all vertices with smaller BF numbers, namely $R_1, R_2$ and $R_3$.
\[{\color{grey}R_2 - R_1 - R_3 - } {\color{red}R_4} \quad\quad \text{Output:}\, \{R_4\}\]$DPccp$ moves on to explore $R_3$, while masking out $R_1, R_2$. In addition to the singleton set {R_3}, it also ouputs an $R_3$’s neighborhood ${R_3, R_4}$.
\[{\color{grey}R_2 - R_1 -} {\color{red}R_3} - R_4 \quad\quad \text{Output:}\, \{R_3\}, \{R_3, R_4\}\]When $DPccp$ explores $R_2$, while masking out $R_1$. Though visible at this moment, neither $R_3$ or $R_4$ is connected to $R_2$.
\[{\color{red}R_2} {\color{grey} - R_1 -} R_3 - R_4 \quad\quad \text{Output:}\, \{R_2\}\]Finally, $DPccp$ reaches $R_1$ with everything in the sight range. The neighborhoods are explored in a depth-first fashion, resulting in an ascending order of their sizes.
\[R_2 - {\color{red}R_1} - R_3 - R_4 \quad\quad \text{Output:}\, \{R_1\}, \{R_1, R_2\}, \{R_1, R_2, R_3\}, \{R_1, R_3\}, \{R_1, R_3, R_4\}\]We may quickly notice two things that $DPccp$ does very well.