Focusing SQL Queries - Querying Large Quantities of Data (Page 3 of 4 )
It may sound obvious, but the sooner we get rid of unwanted data, the less we have to process at later stages of a query—and the more efficiently the query will run. An excellent application of this principle can be found with set operators, of which union is probably the most widely used. It is quite common to find in a moderately complex uniona number of tables appearing in several of the queries “glued” together with theunionoperator. One often sees theunionof fairly complex joins, with most of the joined tables occurring in bothselectstatements of theunion—for example, on both sides of theunion, something like the following:
select ... from A, B, C, D, E1 where (condition on E1) and (joins and other conditions)
union select ... from A, B, C, D, E2 where (condition on E2) and (joins and other conditions)
This type of query is typical of the cut-and-paste school of programming. In many cases it may be more efficient to use aunionof those tables that are not common, complete with the screening conditions, and to then push thatunioninto an inline view and join the result, writing something similar to:
select ... from A, B, C, D, (select ... from E1 where (condition on E1) union select ... from E2 where (condition on E2)) E where (joins and other conditions)
Another classic example of conditions applied at the wrong place is a danger associated with filtering when a statement contains agroup byclause. You can filter on the columns that define the grouping, or the result of the aggregate (for instance when you want to check whether the result of acount()is smaller than a threshold) or both. SQL allows you to specify all such conditions inside thehaving clause that filters after thegroup by(in practice, a sort followed by an aggregation) has been completed. Any condition bearing on the result of an aggregate function must be inside thehaving clause, since the result of such a function is unknown before thegroup by. Any condition that is independent on the aggregate should go to thewhere clause and contribute to decrease the number of rows that we shall have to sort to perform thegroup by.
Let’s return to our customers and orders example, admitting that the way we process orders is rather complicated. Before an order is considered complete, we have to go through several phases that are recorded in the tableorderstatus, of which the main columns areordid, the identifier of the order;status; andstatusdate, which is a timestamp. The primary key is compound, consisting ofordid, andstatusdate. Our requirement is to list, for all orders for which the status is not flagged as complete (assumed to be final), the identifier of the order, the customer name, the last known order status, and when this status was set. To that end, we might build the following query, filtering out completed orders and identifying the current status as the latest status assigned:
select c.custname, o.ordid, os.status, os.statusdate from customers c, orders o, orderstatus os where o.ordid = os.ordid and not exists (select null from orderstatus os2 where os2.status = 'COMPLETE' and os2.ordid = o.ordid) and os.statusdate = (select max(statusdate) from orderstatus os3 where os3.ordid = o.ordid) and o.custid = c.custid
At first sight this query looks reasonable, but in fact it contains a number of deeply disturbing features. First, notice that we have two subqueries, and notice too that they are not nested as in the previous examples, but are only indirectly related to each other. Most worrying of all, both subqueries hit the very same table, already referenced at the outer level. What kind of filtering condition are we providing? Not a very precise one, as it only checks for the fact that orders are not yet complete.
How can such a query be executed? An obvious approach is to scan theorderstable, for each row checking whether each order is or is not complete. (Note that we might have been happy to find this information in the orders table itself, but this is not the case.) Then, and only then, can we check the date of the most recent status, executing the subqueries in the order in which they are written.
The unpleasant fact is that both subqueries are correlated. Since we have to scan theorderstable, it means that for every row fromorderswe shall have to check whether we encounter the status set toCOMPLETEfor that order. The subquery to check for that status will be fast to execute, but not so fast when repeated a large number of times. When there is noCOMPLETEstatus to be found, then a second subquery must be executed. What about trying to un-correlate queries?
The easiest query to un-correlate happens to be the second one. In fact, we can write, at least with some SQL dialects:
and (o.ordid, os.statusdate) = (select ordid, max(statusdate) from orderstatus group by ordid)
The subquery that we have now will require a full scan oforderstatus; but that’s not necessarily bad, and we’ll discuss our reasoning in a moment.
There is something quite awkward in the condition of the pair of columns on the left-hand side of the rewritten subquery condition. These columns come from different tables, and they need not do so. In fact, we want the order identifier to be the same inorders andorderstatus; will the optimizer understand the subtlety of this situation? That is rather uncertain. If the optimizer doesn’t understand, then it will be able to execute the subquery first, but will have to join the two other tables together before being able to exploit the result of the subquery. If the query were written slightly differently, the optimizer would have greater freedom to decide whether it actually wants to do what I’ve just described or exploit the result of the subquery and then joinorderstoorderstatus:
and (os.ordid, os.statusdate) = (select ordid, max(statusdate) from orderstatus group by ordid)
The reference on the left side to two columns from the same table removes the dependence of identification of the most recent status for the order on a preliminary join betweenorderstatusandorders. A very clever optimizer might have performed the modification for us, but it is wiser to take no risk and specify both columns from the same table to begin with. It is always much better to leave the optimizer with as much freedom as we can.
You have seen previously that an uncorrelated subquery can become a join in an inline view without much effort. We can indeed rewrite the entire query to list pending orders as follows:
select c.custname, o.ordid, os.status, os.statusdate from customers c, orders o, orderstatus os, (select ordid, max(statusdate) laststatusdate from orderstatus group by ordid) x where o.ordid = os.ordid and not exists (select null from orderstatus os2 where os2.status = 'COMPLETE' and os2.ordid = o.ordid) and os.statusdate = x.laststatusdate and os.ordid = x.ordid and o.custid = c.custid
But then, ifCOMPLETE is indeed the final status, do we need the subquery to check the nonexistence of the last stage? The inline view helps us to identify which is the last status, whether it isCOMPLETEor anything else. We can apply a perfectly satisfactory filter by checking the latest known status:
select c.custname, o.ordid, os.status, os.statusdate from customers c, orders o, orderstatus os, (select ordid, max(statusdate) laststatusdate from orderstatus group by ordid) x where o.ordid = os.ordid and os.statusdate = x.laststatusdate and os.ordid = x.ordid and os.status != 'COMPLETE' and o.custid = c.custid
The duplicate reference toorderstatuscan be further avoided by using OLAP or analytical functions available with some SQL engines. But let’s pause here and consider how we have modified the query and, more importantly, the execution path. Basically, our natural path was initially to scan theorderstable, and then access through what may reasonably be expected to be an efficient index on the tableorderstatus. In the last version of our query, we will attack through a full scan oforderstatus, to perform agroup by. In terms of the number of rows,orderstatuswill necessarily be several times bigger thanorders. However, in terms of mere volume of data to scan, we can expect it to be smaller, possibly significantly smaller, depending on how much information is stored for each order.
We cannot say with certainty which approach will be better, it depends on the data. Let me add that seeing a full scan on a table that is expected to grow is not a good idea (restricting the search to the last month’s, or last few months’ worth of data can help). But there are significant chances that this last version of our query will perform better than the first attempt with the subquery in thewhereclause.
We cannot leave the subject of large data volumes without mentioning a slightly special case. When a query returns a very large amount of data, you have reasonable grounds for suspecting that it’s not an individual sitting at a terminal that executed the query. The likelihood is that such a query is part of a batch process. Even if there is a longish “preparatory phase,” nobody will complain so long as the whole process performs to a satisfactory standard. Do not, of course, forget that a phase, preparatory or not, requires resources—CPU, memory, and possibly temporary disk space. It helps to understand that the optimizer, when returning a lot of data, may choose a path which has nothing in common with the path it would adopt when returning few rows, even if the fundamental query is identical.