5.5Natural Join
T he next operation we will look at is the Natural Join (hereafter referred to simply as Join). This operation takes two source relations as inputs and produces a relation whose tuples are formed by concatenating tuples from each input source. It is basically a cartesian product of the extensions of each input source. However, not all possible combinations of tuples necessarily end up in the result. This is because it implicitly selects from among all possible tuple combinations only those that have identical values in attributes shared by both relations.
Figure 5-6 The Join combines two relations over one or more common domains
Thus, in a typical application of a Join, the intensions of the input sources share at least one attribute name or domain (we assume here that attribute names are global to a schema, ie. the same name occurring in different relations denote the same attribute and value domain). The Join is said to occur over such domain(s). Figure 5 -6 illustrates the general effect. The shaded left-most two columns of the inputs are notionally the shared attributes. The result comprise these and the concatenation of the other columns from each input. More precisely, if the degree of the input sources were m and n respectively, and the number of shared attributes was s, then the degree of the resultant relation is (m+n–s).
As an example, consider the two relations below:
Transaction
|
C#
|
P#
|
Date
|
Qnt
|
1
|
1
|
21.01
|
20
|
1
|
2
|
23.01
|
30
|
2
|
1
|
26.01
|
25
|
2
|
2
|
29.01
|
20
|
-
Customer
|
C#
|
Cname
|
Ccity
|
Cphone
|
1
|
Codd
|
London
|
2263035
|
2
|
Martin
|
Paris
|
5555910
|
3
|
Deen
|
London
|
2234391
|
Shared attribute
(s = 1)
These relations share the attribute ‘C#’, as indicated. To compute the join of these relations, consider in turn every possible pair of tuples formed by taking one tuple from each relation, and examine the values of their shared attribute. So if the pair under consideration was
1, Codd, London, 2263035 and 1, 1, 21.01, 20
we would find that the values match exactly. In such a case, we concatenate them and add the concatenation to the resultant relation. It doesn’t matter if the second tuple is concatenated to the end of the first, or the first to the second, as long as we are consistent about it. By convention, we use the former. Additionally, we omit the second occurrence of the shared attribute in the result (repeated occurrence is superfluous). This gives us the tuple
1, Codd, London, 2263035, 1, 21.01, 20
If, on the other hand, the pair under consideration was
3, Deen, London, 2234391 and 1, 1, 21.01, 20
we would ignore it because the values of their shared attributes do not match exactly.
Thus, the resultant relation after considering all pairs would be:
-
Result
|
C#
|
Cname
|
Ccity
|
Cphone
|
P#
|
Date
|
Qnt
|
1
|
Codd
|
London
|
2263035
|
1
|
21.01
|
20
|
1
|
Codd
|
London
|
2263035
|
2
|
23.01
|
30
|
2
|
Martin
|
Paris
|
5555910
|
1
|
26.01
|
25
|
2
|
Martin
|
Paris
|
5555910
|
2
|
29.01
|
20
|
The foregoing description is in fact general enough to admit operations on relations that do not share any attributes at all (s = 0). The join, in such a case, is simply the cartesian product of the input sources’ extensions (the condition that tuple combinations have identical values over shared attributes is vacuously true since there are no shared attributes!). However, such uses of the operation are atypical.
Syntactically, we will write the Join operation as follows:
join 1 AND 2
over
giving
where again
-
I is a valid relation name (in the schema or the result of a previous operation)
-
is a comma-separated non-empty list of attribute names, each of which must occur in both input sources, and
-
is a unique identifier denoting the resultant relation
With this syntax, particularly with the over-clause, we have in fact taken the liberty
-
to insist that the join must be over at least one shared attribute, ie. we disallow expressions of pure cartesian products of two relations that do not share any attribute. This restriction is of no practical consequence, however, as in practice a Join is used to bring together information from different relations related through some common value.
-
to allow a join over a subset of shared attributes, ie. we relax (generalise) the restriction that a Join is over all shared attributes.
If a Join is over a proper subset of shared attributes, then shared attributes not specified in the over-clause will each have its own column in the result relation. But in such cases, the respective column labels will be qualified names. We will adopt the convention of writing a qualified name as ‘.’, where is the column label and the relation name in which appears. As an illustration, consider the relations below:
-
R1
|
|
|
|
R2
|
|
|
A1
|
A2
|
X
|
|
A1
|
A2
|
Y
|
1
|
2
|
abc
|
|
2
|
3
|
pqr
|
1
|
3
|
def
|
|
2
|
2
|
xyz
|
2
|
4
|
ijk
|
|
|
|
|
The operation
join R1 AND R2 over A1 giving Result
will yield
-
Result
|
|
|
|
|
A1
|
R1.A2
|
X
|
R2.A2
|
Y
|
2
|
4
|
ijk
|
3
|
pqr
|
2
|
4
|
ijk
|
2
|
xyz
|
To see why Join is a necessary operation in the algebra, consider the following situation (assume as context the Customer and Transaction relations above): the company decided that customers who purchased product number 1 (P# = 1) should be informed that a fault has been discovered in the product and that, as a sign of good faith and of how it values its customers, it will replace the product with a brand new fault-free one. To do this, we need to list, therefore, the names and phone numbers of all such customers.
First, we need to identify all customers who purchased product number 1. This information is in the Transaction relation and, using the following selection operation, it is easy to limit its extension to only such customers:
Next, we note that the resultant relation only identifies such customers by their customer numbers. What we need, though, are their names and phone numbers. In other words, we would like to extend each tuple in A with the customer name and phone number corresponding to the customer number. As such items are found in the relation Customer which shares the attribute ‘C#’ with A, the join is a natural operation to perform:
With B, we have practically derived the information we need—in fact, more than we need, since we are interested only in the customer name (the ‘Cname’ column) and phone number (the ‘Cphone’ column). But as we’ve learned, the irrelevant columns may be easily removed using projection, as shown below.
As a final example, let us also assume we have the Product relation, in addition to the Customer and Transaction relations:
-
Product
|
|
|
P#
|
Pname
|
Pprice
|
1
|
CPU
|
1000
|
2
|
VDU
|
1200
|
The task is to “get the names of products sold to customers in London”. Once again, this task will require a combination of operations which must involve a Join at some point because not all the information required are contained in one relation. The sequence of operations required is shown below.
-
Customer
|
C#
|
Cname
|
Ccity
|
Cphone
|
1
|
Codd
|
London
|
2263035
|
2
|
Martin
|
Paris
|
5555910
|
3
|
Deen
|
London
|
2234391
|
select Customer where Ccity = “London” giving A
-
A
|
|
|
|
|
Transaction
|
|
|
C#
|
Cname
|
Ccity
|
Cphone
|
|
C#
|
P#
|
Date
|
Qnt
|
1
|
Codd
|
London
|
2263035
|
|
1
|
1
|
21.01
|
20
|
3
|
Deen
|
London
|
2234391
|
|
1
|
2
|
23.01
|
30
|
|
|
|
|
|
2
|
1
|
26.01
|
25
|
|
|
|
|
|
2
|
2
|
29.01
|
20
|
join A AND Transaction over C# giving B
-
B
|
|
|
|
|
|
|
|
Product
|
|
C#
|
Cname
|
Ccity
|
...
|
P#
|
Date
|
Qnt
|
|
P#
|
Pname
|
Pprice
|
1
|
Codd
|
London
|
...
|
1
|
21.01
|
20
|
|
1
|
CPU
|
1000
|
1
|
Codd
|
London
|
...
|
2
|
23.01
|
30
|
|
2
|
VDU
|
1200
|
join B AND Product over P# giving C
-
C
|
|
|
|
|
|
|
|
|
C#
|
Cname
|
Ccity
|
Cphone
|
P#
|
Date
|
Qnt
|
Pname
|
Pprice
|
1
|
Codd
|
London
|
2263035
|
1
|
21.01
|
20
|
CPU
|
1000
|
1
|
Codd
|
London
|
2263035
|
2
|
23.01
|
30
|
VDU
|
1200
|
project C over Pname giving Result
-
Formal Definition
As before, if denotes a relation, then let
S() denote the finite set of attribute names of (ie. its intension)
T() denote the finite set of tuples of (ie. its extension)
, where T() and S(), denote the value of attribute in tuple
Further, if 1 and 2 are tuples, let 1^2 denote the tuple resulting from appending 2 to the end of 1.
We will also have need to use the terminology introduced in defining projection above, in particular, Stuple and the definition:
R(, , ’) Stuple() Stuple(’)
The (natural) join operation takes the form
join AND over giving
As with other operations, the input sources and must denote valid relations that are either defined in the schema or are results of previous operations, and must be a unique identifier to denote the result of the join. is a tuple of attribute names such that:
Stuple() (S() S())
Let = (S() S()) – Stuple(), ie. the set of shared attribute names not specified in the over-clause. We next define, for any relation r:
Rename(r, ) { | S(r) – ( = ‘r.p’ p S(r) ) }
In the case that = {} or S(r) = {}, Rename(r, ) = S(r).
The Join operation can then be characterised by the following:
-
S() Rename(,) Rename(,)
-
T() { 1^2 | 1 T() T() R(, , 2)
Stuple() 1 = }
where
Stuple() = S() – Stuple()
Share with your friends: |