Home arrow SQL arrow Page 3 - Focusing SQL Queries
SQL

Focusing SQL Queries


In this conclusion to a three-part series that examines the SQL query, you will learn some alternative ways of phrasing a query, and how to query large quantities of data. This article is excerpted from chapter four of The Art of SQL, written by Stephane Faroult and Peter Robson (O'Reilly; ISBN: 0596008945). Copyright © 2006 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

Author Info:
By: O'Reilly Media
Rating: 4 stars4 stars4 stars4 stars4 stars / 26
January 17, 2008
TABLE OF CONTENTS:
  1. · Focusing SQL Queries
  2. · Lessons to be learned from the Batmobile trade
  3. · Querying Large Quantities of Data
  4. · The Proportions of Retrieved Data

print this article
SEARCH DEVARTICLES

Focusing SQL Queries - Querying Large Quantities of Data
(Page 3 of 4 )

 

It may sound obvious, but the sooner we get rid of unwanted data, the less we have to process at later stages of a query—and the more efficiently the query will run. An excellent application of this principle can be found with set operators, of which union is probably the most widely used. It is quite common to find in a moderately complex uniona number of tables appearing in several of the queries “glued” together with theunionoperator. One often sees theunionof fairly complex joins, with most of the joined tables occurring in bothselectstatements of theunion—for example, on both sides of theunion, something like the following:

  select ...
 
from A,
       B,
       C,
       D,
       E1
  where (condition on E1)
    and (joins and other conditions)

  union
  select ...
  from A,
      
B,
      
C,
      
D,
      
E2
 
where (condition on E2)
    and (joins and other conditions)

This type of query is typical of the cut-and-paste school of programming. In many cases it may be more efficient to use aunionof those tables that are not common, complete with the screening conditions, and to then push thatunioninto an inline view and join the result, writing something similar to:

  select ...
  from A,
       B,
      
C,
      
D,
      
(select ...
       
from E1
       
where (condition on E1)
       
union
       
select ...
       
from E2
       
where (condition on E2)) E
 
where (joins and other conditions)

Another classic example of conditions applied at the wrong place is a danger associated with filtering when a statement contains agroup byclause. You can filter on the columns that define the grouping, or the result of the aggregate (for instance when you want to check whether the result of acount()is smaller than a threshold) or both. SQL allows you to specify all such conditions inside thehaving clause that filters after thegroup by(in practice, a sort followed by an aggregation) has been completed. Any condition bearing on the result of an aggregate function must be inside thehaving clause, since the result of such a function is unknown before thegroup by. Any condition that is independent on the aggregate should go to thewhere clause and contribute to decrease the number of rows that we shall have to sort to perform thegroup by.

Let’s return to our customers and orders example, admitting that the way we process orders is rather complicated. Before an order is considered complete, we have to go through several phases that are recorded in the tableorderstatus, of which the main columns areordid, the identifier of the order;status; andstatusdate, which is a timestamp. The primary key is compound, consisting ofordid, andstatusdate. Our requirement is to list, for all orders for which the status is not flagged as complete (assumed to be final), the identifier of the order, the customer name, the last known order status, and when this status was set. To that end, we might build the following query, filtering out completed orders and identifying the current status as the latest status assigned:

  select c.custname, o.ordid, os.status, os.statusdate
 
from customers c,
      
orders o,
      
orderstatus os
 
where o.ordid = os.ordid
   
and not exists (select null
                   
from orderstatus os2
                   
where os2.status = 'COMPLETE'
                     
and os2.ordid = o.ordid)
    and os.statusdate = (select max(statusdate)
                        
from orderstatus os3
                        
where os3.ordid = o.ordid)
    and o.custid = c.custid

At first sight this query looks reasonable, but in fact it contains a number of deeply disturbing features. First, notice that we have two subqueries, and notice too that they are not nested as in the previous examples, but are only indirectly related to each other. Most worrying of all, both subqueries hit the very same table, already referenced at the outer level. What kind of filtering condition are we providing? Not a very precise one, as it only checks for the fact that orders are not yet complete.

How can such a query be executed? An obvious approach is to scan theorderstable, for each row checking whether each order is or is not complete. (Note that we might have been happy to find this information in the orders table itself, but this is not the case.) Then, and only then, can we check the date of the most recent status, executing the subqueries in the order in which they are written.

The unpleasant fact is that both subqueries are correlated. Since we have to scan theorderstable, it means that for every row fromorderswe shall have to check whether we encounter the status set toCOMPLETEfor that order. The subquery to check for that status will be fast to execute, but not so fast when repeated a large number of times. When there is noCOMPLETEstatus to be found, then a second subquery must be executed. What about trying to un-correlate queries?

The easiest query to un-correlate happens to be the second one. In fact, we can write, at least with some SQL dialects:

  and (o.ordid, os.statusdate) = (select ordid, max(statusdate)
                                  from orderstatus
                                  group by ordid)

The subquery that we have now will require a full scan oforderstatus; but that’s not necessarily bad, and we’ll discuss our reasoning in a moment.

There is something quite awkward in the condition of the pair of columns on the left-hand side of the rewritten subquery condition. These columns come from different tables, and they need not do so. In fact, we want the order identifier to be the same inorders andorderstatus; will the optimizer understand the subtlety of this situation? That is rather uncertain. If the optimizer doesn’t understand, then it will be able to execute the subquery first, but will have to join the two other tables together before being able to exploit the result of the subquery. If the query were written slightly differently, the optimizer would have greater freedom to decide whether it actually wants to do what I’ve just described or exploit the result of the subquery and then joinorderstoorderstatus:

  and (os.ordid, os.statusdate) = (select ordid, max(statusdate)
                                  
from orderstatus
                                  
group by ordid)

The reference on the left side to two columns from the same table removes the dependence of identification of the most recent status for the order on a preliminary join betweenorderstatusandorders. A very clever optimizer might have performed the modification for us, but it is wiser to take no risk and specify both columns from the same table to begin with. It is always much better to leave the optimizer with as much freedom as we can.

You have seen previously that an uncorrelated subquery can become a join in an inline view without much effort. We can indeed rewrite the entire query to list pending orders as follows:

  select c.custname, o.ordid, os.status, os.statusdate
 
from customers c,
      
orders o,
      
orderstatus os,
      
(select ordid, max(statusdate) laststatusdate
       
from orderstatus
       
group by ordid) x
  where o.ordid = os.ordid
   
and not exists (select null
                   
from orderstatus os2
                   
where os2.status = 'COMPLETE'
                     
and os2.ordid = o.ordid)
    and os.statusdate = x.laststatusdate
    and os.ordid = x.ordid
    and o.custid = c.custid

But then, ifCOMPLETE is indeed the final status, do we need the subquery to check the nonexistence of the last stage? The inline view helps us to identify which is the last status, whether it isCOMPLETEor anything else. We can apply a perfectly satisfactory filter by checking the latest known status:

  select c.custname, o.ordid, os.status, os.statusdate
  from customers c,
      
orders o,
      
orderstatus os,
      
(select ordid, max(statusdate) laststatusdate
       
from orderstatus
       
group by ordid) x
 
where o.ordid = os.ordid
   
and os.statusdate = x.laststatusdate
   
and os.ordid = x.ordid
   
and os.status != 'COMPLETE'
   
and o.custid = c.custid

The duplicate reference toorderstatuscan be further avoided by using OLAP or analytical functions available with some SQL engines. But let’s pause here and consider how we have modified the query and, more importantly, the execution path. Basically, our natural path was initially to scan theorderstable, and then access through what may reasonably be expected to be an efficient index on the tableorderstatus. In the last version of our query, we will attack through a full scan oforderstatus, to perform agroup by. In terms of the number of rows,orderstatuswill necessarily be several times bigger thanorders. However, in terms of mere volume of data to scan, we can expect it to be smaller, possibly significantly smaller, depending on how much information is stored for each order.

We cannot say with certainty which approach will be better, it depends on the data. Let me add that seeing a full scan on a table that is expected to grow is not a good idea (restricting the search to the last month’s, or last few months’ worth of data can help). But there are significant chances that this last version of our query will perform better than the first attempt with the subquery in thewhereclause.

We cannot leave the subject of large data volumes without mentioning a slightly special case. When a query returns a very large amount of data, you have reasonable grounds for suspecting that it’s not an individual sitting at a terminal that executed the query. The likelihood is that such a query is part of a batch process. Even if there is a longish “preparatory phase,” nobody will complain so long as the whole process performs to a satisfactory standard. Do not, of course, forget that a phase, preparatory or not, requires resources—CPU, memory, and possibly temporary disk space. It helps to understand that the optimizer, when returning a lot of data, may choose a path which has nothing in common with the path it would adopt when returning few rows, even if the fundamental query is identical.

Filter out unneeded data as early as possible.


blog comments powered by Disqus
SQL ARTICLES

- Focusing SQL Queries
- Complex SQL Queries
- A Close Look at the SQL Query
- Generating Reports with SQL Date Handling
- Creating SQL Reports Based on Date Criteria
- SQL Date Handling and Data Trends
- Date Handling
- Introduction to SQL
- Lies, Damn Lies, Statistics, and SQL

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials