Each relational operation entails a certain amount of work: retrieving a tuple, examining a tuple’s attribute values, comparing attribute values, creating new tuples, repeating a process on each tuple in a relation, etc. For a given operation, the amount of work clearly varies with the cardinality of source relation(s). For example, a selection performed on a relation twice the cardinality of another (of the same degree) would involve twice as much work.
We can also compare the relative amount of work needed between different operations based on the number of tuples processed. An operation with two source inputs, for example, need to repeat its logic on every possible tuple-pair formed by taking a tuple from each input relation. Thus if we had two relations of cardinalities M and N respectively, a total of MN tuple-pairs must be processed, ie. M (or N) times more than, say, a selection operation on each individual relation. Of course, this is not an exact relative measure of work, as there are also differences in the amount of work expended by different operations at the tuple level. By and large, however, we are interested in the order of magnitude of work (rather than the exact amount of work) and this is fairly well approximated by the number of tuples processed.
We will call such a measure the efficiency of an operation. Thus, the efficiency of selection and projection is the cardinality of its single input relation, while the efficiency of join, divide and set operations is the product of the respective cardinalities of their two input relations.
Why should the efficiency of operations interest us? Consider the following sequence of operations:
join Customer AND Transaction over C# giving X;
select X where CCity = “London” giving Result
Suppose the cardinality of Customer was 100 and that of Transaction was 1000. Then the efficiency of the join operation is 1001000 = 100000. The cardinality of X is 1000 (as it is certainly intended that the C# in every Transaction tuple matches a C# in one of the Customer tuples). Therefore, the efficiency of the selection is 1000. As these two operations are performed one after another, the efficiency of the entire sequence of operations is naturally the sum of their individual efficiencies, ie. 100000+1000 = 101000.
Now consider the following sequence:
select Customer where CCity = “London” giving X;
join X AND Transaction over C# giving Result
The reader can verify that this sequence is relationally equivalent to the first, ie. they produce identical results. But how does its efficiency compare with that of the first? Let us calculate using the same assumptions about the cardinalities. The efficiency of the selection is 100. To estimate the efficiency of the join, we need to make an assumption on the cardinality of X. Let’s say that 10 customers live in London. Then the efficiency of the join is 101000 = 10000, and the efficiency of the sequence as a whole is 100+10000 = 10100—ten times more efficient than the first!
Of course, the reader may think that the assumption about X’s cardinality was contrived to give this dramatic performance improvement. The point, however, is that the second sequence can do no worse than the first, ie. if all customers in the Customer relation live in London, then it performs as poorly as the first. More likely, however, we expect a performance improvement.
The above example illustrates a very important point about relational algebra: there can be more than one (sequence of) expression that describe a desired result. The main aim of optimisation, therefore, is to translate a given (sequence of) expression into its most efficient equivalent form. Such optimisation may be done manually by a human user or automatically by the database management system. Automatic optimisation may in fact do better because the automatic optimiser has access to information that is not readily available to a human optimiser, eg. current cardinalities of source relations, current data values, etc. But the overwhelming majority of relational DBMS’s available today merely execute operations requested by users as is. Thus, it is important that users know how to perform optimisations manually.
F or manual optimisation, it is perhaps less important to derive the most efficient form of a query than to follow certain guidelines, heuristics or rules-of-thumb that lead to more efficient expressions. Frequently the latter will lead to acceptable performance and expending more effort to find the optimal expression may not significantly improve that performance if good heuristics are used. There is, in fact, a simple and effective rule to remember when writing queries: delay as long as possible the use of expensive operations! In particular, we should wherever possible put selection ahead of other operations because it reduces the cardinality of relations. Figure 6 -11 illustrate the application of this principle. The reader should be able to verify that the two sequences of operations are logically equivalent and that intuitively the selection operations before the joins can significantly improve the efficiency of the query.
Figure 6-11 Delay expensive operations
7Relational Calculus (Part I) 7.1Introduction
We established earlier the fundamental role of relational algebra and calculus in relational databases (see 5.1). More specifically, relational calculus is the basis for the notion of relational completeness of a database language, ie. any language that can define any relation expressible in relational calculus is relationally complete.
Relational Algebra (see chapters 5 and 6) is one such language. Its approach is procedural, ie. it provides a number of basic operations on relations and successive applications of these operations must be properly sequenced to derive answers to database queries. The basic operators are in themselves quite simple and easy to understand. However, except for fairly simple queries, the construction of operation sequences can be quite complex. Furthermore, such constructions must also consider efficiency issues and strive to find optimal ones (see 6.5). A considerable amount of programming skill is therefore required to effectively use relational algebra.
Relational Calculus takes a different approach to the human–database interface. Rather than requiring users to specify how relations are to be manipulated, it only requires them to define what the desired result is. How the result is actually computed, ie. the operations used, their sequencing and optimisation, is left to the database management system to work out. As it doesn’t deal with procedures (ie. sequencing of operations), this approach is frequently termed non-procedural or declarative.
Relational Calculus is mainly based on the well-known Propositional Calculus, which is a method of calculating with sentences or declarations. Such sentences or declarations, also termed propositions, are ones for which a truth value (ie. “true” or “false”) may be assigned. These can be simple sentences, such as “the ball is red”, or they may be more complex involving one or more simple sentences, such as “the ball is red AND the playing field is green”. The truth value of complex sentences will of course depend on the truth values of their components. This is in fact what the calculus ‘calculates’, using rules for combining truth values of component sentences.
In Relational Calculus, the sentences we deal with are simpler and refer specifically to the relations and values in the database of interest. Simple sentences typically take the form of comparisons of values denoted by variables or constants, eg. X 3, X < Y, etc. More complex sentences are built using logical connectives And (‘&’) and Or (‘|’), eg. X > 7 & X < Y | X 5. Simple and complex sentences like these are examples of Well-Formed Formulae, which we will define fully later.
Regardless of their exact syntax, a formula is in principle a logical function with one or more free variables. For purposes of illustration, we will write such functions as in the following annotated example:
I n the above example, there is one free variable, X. The value of the function can be computed for specific instances of X. Thus,
F(15) (15 > 12 & 15 < 18) (true & true) true
F(10) (10 > 12 & 10 < 18) (false & true) false
Additionally, free variables are deemed to range over a set of permitted values, ie. only such values can instantiate them. We shall see the significance of this later, as applied to relations. But just to illustrate the concept for now, consider the following function over two free variables:
F(X,Y) =: X > Y & Y < 12
Suppose X ranges over {8, 15} and Y ranges over {7,14}. Then F(8, 14) and F(15, 7) are allowable instantiations of the function, with truth values false and true respectively, whereas F(1000,200) is not a valid instantiation. Such restrictions of values over which free variables range become significant when we interpret a formula as the simple query: “get the set of values of free variables for which the formula evaluates to true”. Thus, for the above formula, we need only construct the following table involving only the permitted values:
-
X
|
Y
|
F(X,Y)
|
8
|
7
|
true
|
8
|
14
|
false
|
15
|
7
|
true
|
15
|
14
|
false
|
The desired set of values can then be read from the rows where F(X,Y) evaluated to true, ie. the set {(8,7), (15,7)}.
Relational Calculus is an application of the above ideas to relations. We will develop these ideas in greater detail in the following sections.
Share with your friends: |