A Close Look at the SQL Query - SQL and the Optimizer (Page 2 of 4 )
An SQL engine that receives a query to process will have to use the optimizer to find out how to execute that query in the most efficient way possible. Here the relational theory strikes again, because that theory informs the optimizer of transformations that are valid equivalents of the semantically correct query initially provided by the developer—even if that original query was clumsily written.
Optimization is when the physical implementation of data comes into play. Depending on the existence of indexes and their usability in relation to a query, some transformations may result in much faster execution than other semantically equivalent transformations. Various storage models that I introduce in Chapter 5 may also make one particular way to execute a query irresistibly attractive. The optimizer examines the disposition of the indexes that are available, the physical layout of data, how much memory is available, and how many processors are available to be applied to the task of executing the query. The optimizer will also take into account information concerning the volume of the various tables and indexes that may be involved, directly or indirectly, through views used by the query. By weighing the alternatives that theory says are valid equivalents against the possibilities allowed by the implementation of the database, the optimizer will generate what is, hopefully, the best execution plan for the query.
However, the key point to remember is that, although the optimizer may not always be totally weaponless in the non-relational layer of an SQL query, it is mainly in the relational core that it will be able to deploy its full power—precisely because of the mathematical underpinnings of the relational theory. The transformation from one SQL query to another raises an important point: it reminds us that SQL is supposed to be a declarative language. In other words, one should use SQL to express what is required, rather than how that requirement is to be met. Going from what to how, should, in theory, be the work of the optimizer.
You saw in Chapters 1 and 2 that SQL queries are only some of the variables in the equation; but even at the tactical query level, a poorly written query may prevent the optimizer from working efficiently. Remember, the mathematical basis of the relational theory provides an unassailable logic to the proceedings. Therefore, part of the art of SQL is to minimize the thickness, so to speak, of the non-relational layer—outside this layer, there is not much that the optimizer can safely do that guarantees returning exactly the same rows as the original query.
Another part of the art of SQL is that when performing non-relational operations—loosely defined as operations for which the whole (at least at this stage) resulting dataset is known—we must be extremely careful to operate on only the data that is strictly required to answer the original question, and nothing more. Somehow, a finite data set, as opposed to the current row, has to be stored somewhere, and storing anything in temporary storage (memory or disk) requires significant overhead due to byte-pushing. This overhead may dramatically increase as the result set data volumes themselves increase, particularly if main memory becomes unavailable. A shortage of main memory would initiate the high-resource activity of swapping to disk, with all its attendant overheads. Moreover, always remember that indexes refer to disk addresses, not temporary storage—as soon as the data is in temporary storage, we must wave farewell to most fast access methods (with the possible exception of hashing).
Some SQL dialects mislead users into believing that they are still in the relational world when they have long since left it. Take as a simple example the query “Who are the five top earners among employees who are not executives?”—a reasonable real-life question, although one that includes a distinctly non-relational twist. Identifying employees who are not executives is the relational part of the query, from which we obtain a finite set of employees that we can order. Several SQL dialects allow one to limit the number of rows returned by adding a special clause to the select statement. It is then fairly obvious that both the ordering and the limitation criteria are outside the relational layer. However, other dialects, the Oracle version figuring prominently here, use other mechanisms. What Oracle has is a dummy column namedrownumthat applies a sequential numbering to the rows in the order in which they are returned—which means the numbering is applied during the relational phase. If we write something such as:
select empname, salary from employees where status != 'EXECUTIVE' and rownum <= 5 order by salary desc
we get an incorrect result, at least in the sense that we are not getting the top five most highly paid nonexecutives, as the query might suggest at first glance. Instead, we get back the first five nonexecutives found—they could be the five lowest paid!—ordered in descending order of salary. (This query illustrates a well-known trap among Oracle practitioners, who have all been burnt at least once.)
Let’s just be very clear about what is happening with the preceding query. The relational component of the query simply retrieves the first five rows (attributesempnameandsalaryonly) from the tableemployeeswhere the employee is not an executive in a totally unpredictable order. Remember that relational theory tells us that a relation (and therefore the table that represents it) is not defined in any way by the order in which tuples (and therefore the rows in that table) are either stored or retrieved. As a consequence the nonexecutive employee with the highest salary may or may not be included in this result set—and there is no way we will ever know whether this result set actually meets our search criteria correctly.
What we really want is to get all nonexecutives, order them by decreasing salary, and only then get the top five in the set. We can achieve this objective as follows:
select * from (select empname, salary from employees where status != 'EXECUTIVE' order by salary desc) where rownum <= 5
So, how is our query layered in this case? Many would be tempted to say that by applying a filtering condition to an ordered result, we end up with something looking more or less like Figure 4-3.
Figure 4-3. A misleding view of what the "top five nonexecutives" query looks like
The truth, however, is more like Figure 4-4.
Using constructs that look relational doesn’t take us back to the relational world, because to be in the relational world we must apply relational operators to relations. Our subquery uses anorder byto sort the results. Once we’ve imposed ordering, we no longer have, strictly speaking, a relation (a relation is a set, and a set has no order). We end up with an outerselectthat looks relational on the surface but is applied to the output of an inline view in which a significant component (theorder byclause) is not a relational process.
Figure 4-4. What the "top five nonexecutives" query is really like
My example of the top five nonexecutives is, of course, a simple example, but do understand that once we have left the relational sphere in the execution of a query, we can no longer return to it. The best we can possibly do is to use the output of such a query to feed into the relational phase of an outer query. For instance, “in which departments are our five top nonexecutive earners working?” What is extremely important to understand, though, is that at this stage no matter how clever the optimizer is, it will be absolutely unable to combine the queries, and will more or less have to execute them in a given sequence. Further, any resulting set from an intermediate query is likely to be held in temporary storage, whether in memory or on disk, where the choice of access methods may be reduced. Once outside the pure relational layer, the way we write a query is of paramount importance for performance because it will inevitably impose onto the query some execution path from which the SQL engine will not be able to stray.
To summarize, we can say that the safest approach we can adopt is to try to do as much of the job as possible inside the relational layer, where the optimizer can operate to maximum efficiency. When the situation is such that a given SQL task is no longer a purely relational problem, then we must be particularly careful about the construct, or the writing of the query itself. Understanding that SQL has, like Dr. Jekyll, a double nature is the key to mastering the language. If you see SQL as a single-edged sword, then you are condemned to remain in the world of tips and tricks for dummies, smarties, and mere mortals, possibly useful for impressing the opposite sex—although in my experience it doesn’t work much—but an approach that will never provide you with a deep understanding of how to cope with a difficult SQL problem.
The optimizer rewards those who do the most work in the relational layer.