SQL
  Home arrow SQL arrow Page 3 - Focusing SQL Queries
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Dedicated Servers  
Actuate Whitepapers 
VeriSign Whitepapers 
IBM® developerWorks 
Sun Developer Network 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
SQL

Focusing SQL Queries
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 4
    2008-01-17

    Table of Contents:
  • Focusing SQL Queries
  • Lessons to be learned from the Batmobile trade
  • Querying Large Quantities of Data
  • The Proportions of Retrieved Data

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT

    Stay one step ahead of the competition. Evaluate and give feedback on some of the hottest web development tools on the market today. Make your opinion heard! Click Here

    Focusing SQL Queries - Querying Large Quantities of Data


    (Page 3 of 4 )

     

    It may sound obvious, but the sooner we get rid of unwanted data, the less we have to process at later stages of a query—and the more efficiently the query will run. An excellent application of this principle can be found with set operators, of which union is probably the most widely used. It is quite common to find in a moderately complex uniona number of tables appearing in several of the queries “glued” together with theunionoperator. One often sees theunionof fairly complex joins, with most of the joined tables occurring in bothselectstatements of theunion—for example, on both sides of theunion, something like the following:

      select ...
     
    from A,
           B,
           C,
           D,
           E1
      where (condition on E1)
        and (joins and other conditions)

      union
      select ...
      from A,
          
    B,
          
    C,
          
    D,
          
    E2
     
    where (condition on E2)
        and (joins and other conditions)

    This type of query is typical of the cut-and-paste school of programming. In many cases it may be more efficient to use aunionof those tables that are not common, complete with the screening conditions, and to then push thatunioninto an inline view and join the result, writing something similar to:

      select ...
      from A,
           B,
          
    C,
          
    D,
          
    (select ...
           
    from E1
           
    where (condition on E1)
           
    union
           
    select ...
           
    from E2
           
    where (condition on E2)) E
     
    where (joins and other conditions)

    Another classic example of conditions applied at the wrong place is a danger associated with filtering when a statement contains agroup byclause. You can filter on the columns that define the grouping, or the result of the aggregate (for instance when you want to check whether the result of acount()is smaller than a threshold) or both. SQL allows you to specify all such conditions inside thehaving clause that filters after thegroup by(in practice, a sort followed by an aggregation) has been completed. Any condition bearing on the result of an aggregate function must be inside thehaving clause, since the result of such a function is unknown before thegroup by. Any condition that is independent on the aggregate should go to thewhere clause and contribute to decrease the number of rows that we shall have to sort to perform thegroup by.

    Let’s return to our customers and orders example, admitting that the way we process orders is rather complicated. Before an order is considered complete, we have to go through several phases that are recorded in the tableorderstatus, of which the main columns areordid, the identifier of the order;status; andstatusdate, which is a timestamp. The primary key is compound, consisting ofordid, andstatusdate. Our requirement is to list, for all orders for which the status is not flagged as complete (assumed to be final), the identifier of the order, the customer name, the last known order status, and when this status was set. To that end, we might build the following query, filtering out completed orders and identifying the current status as the latest status assigned:

      select c.custname, o.ordid, os.status, os.statusdate
     
    from customers c,
          
    orders o,
          
    orderstatus os
     
    where o.ordid = os.ordid
       
    and not exists (select null
                       
    from orderstatus os2
                       
    where os2.status = 'COMPLETE'
                         
    and os2.ordid = o.ordid)
        and os.statusdate = (select max(statusdate)
                            
    from orderstatus os3
                            
    where os3.ordid = o.ordid)
        and o.custid = c.custid

    At first sight this query looks reasonable, but in fact it contains a number of deeply disturbing features. First, notice that we have two subqueries, and notice too that they are not nested as in the previous examples, but are only indirectly related to each other. Most worrying of all, both subqueries hit the very same table, already referenced at the outer level. What kind of filtering condition are we providing? Not a very precise one, as it only checks for the fact that orders are not yet complete.

    How can such a query be executed? An obvious approach is to scan theorderstable, for each row checking whether each order is or is not complete. (Note that we might have been happy to find this information in the orders table itself, but this is not the case.) Then, and only then, can we check the date of the most recent status, executing the subqueries in the order in which they are written.

    The unpleasant fact is that both subqueries are correlated. Since we have to scan theorderstable, it means that for every row fromorderswe shall have to check whether we encounter the status set toCOMPLETEfor that order. The subquery to check for that status will be fast to execute, but not so fast when repeated a large number of times. When there is noCOMPLETEstatus to be found, then a second subquery must be executed. What about trying to un-correlate queries?

    The easiest query to un-correlate happens to be the second one. In fact, we can write, at least with some SQL dialects:

      and (o.ordid, os.statusdate) = (select ordid, max(statusdate)
                                      from orderstatus
                                      group by ordid)

    The subquery that we have now will require a full scan oforderstatus; but that’s not necessarily bad, and we’ll discuss our reasoning in a moment.

    There is something quite awkward in the condition of the pair of columns on the left-hand side of the rewritten subquery condition. These columns come from different tables, and they need not do so. In fact, we want the order identifier to be the same inorders andorderstatus; will the optimizer understand the subtlety of this situation? That is rather uncertain. If the optimizer doesn’t understand, then it will be able to execute the subquery first, but will have to join the two other tables together before being able to exploit the result of the subquery. If the query were written slightly differently, the optimizer would have greater freedom to decide whether it actually wants to do what I’ve just described or exploit the result of the subquery and then joinorderstoorderstatus:

      and (os.ordid, os.statusdate) = (select ordid, max(statusdate)
                                      
    from orderstatus
                                      
    group by ordid)

    The reference on the left side to two columns from the same table removes the dependence of identification of the most recent status for the order on a preliminary join betweenorderstatusandorders. A very clever optimizer might have performed the modification for us, but it is wiser to take no risk and specify both columns from the same table to begin with. It is always much better to leave the optimizer with as much freedom as we can.

    You have seen previously that an uncorrelated subquery can become a join in an inline view without much effort. We can indeed rewrite the entire query to list pending orders as follows:

      select c.custname, o.ordid, os.status, os.statusdate
     
    from customers c,
          
    orders o,
          
    orderstatus os,
          
    (select ordid, max(statusdate) laststatusdate
           
    from orderstatus
           
    group by ordid) x
      where o.ordid = os.ordid
       
    and not exists (select null
                       
    from orderstatus os2
                       
    where os2.status = 'COMPLETE'
                         
    and os2.ordid = o.ordid)
        and os.statusdate = x.laststatusdate
        and os.ordid = x.ordid
        and o.custid = c.custid

    But then, ifCOMPLETE is indeed the final status, do we need the subquery to check the nonexistence of the last stage? The inline view helps us to identify which is the last status, whether it isCOMPLETEor anything else. We can apply a perfectly satisfactory filter by checking the latest known status:

      select c.custname, o.ordid, os.status, os.statusdate
      from customers c,
          
    orders o,
          
    orderstatus os,
          
    (select ordid, max(statusdate) laststatusdate
           
    from orderstatus
           
    group by ordid) x
     
    where o.ordid = os.ordid
       
    and os.statusdate = x.laststatusdate
       
    and os.ordid = x.ordid
       
    and os.status != 'COMPLETE'
       
    and o.custid = c.custid

    The duplicate reference toorderstatuscan be further avoided by using OLAP or analytical functions available with some SQL engines. But let’s pause here and consider how we have modified the query and, more importantly, the execution path. Basically, our natural path was initially to scan theorderstable, and then access through what may reasonably be expected to be an efficient index on the tableorderstatus. In the last version of our query, we will attack through a full scan oforderstatus, to perform agroup by. In terms of the number of rows,orderstatuswill necessarily be several times bigger thanorders. However, in terms of mere volume of data to scan, we can expect it to be smaller, possibly significantly smaller, depending on how much information is stored for each order.

    We cannot say with certainty which approach will be better, it depends on the data. Let me add that seeing a full scan on a table that is expected to grow is not a good idea (restricting the search to the last month’s, or last few months’ worth of data can help). But there are significant chances that this last version of our query will perform better than the first attempt with the subquery in thewhereclause.

    We cannot leave the subject of large data volumes without mentioning a slightly special case. When a query returns a very large amount of data, you have reasonable grounds for suspecting that it’s not an individual sitting at a terminal that executed the query. The likelihood is that such a query is part of a batch process. Even if there is a longish “preparatory phase,” nobody will complain so long as the whole process performs to a satisfactory standard. Do not, of course, forget that a phase, preparatory or not, requires resources—CPU, memory, and possibly temporary disk space. It helps to understand that the optimizer, when returning a lot of data, may choose a path which has nothing in common with the path it would adopt when returning few rows, even if the fundamental query is identical.

    Filter out unneeded data as early as possible.

    More SQL Articles
    More By O'Reilly Media


       · This article is an excerpt from the book "The Art of SQL," published by O'Reilly. We...
     

    Buy this book now. This article is excerpted from chapter four of The Art of SQL, written by Stephane Faroult and Peter Robson (O'Reilly; ISBN: 0596008945). Check it out today at your favorite bookstore. Buy this book now.

    SQL ARTICLES

    - Focusing SQL Queries
    - Complex SQL Queries
    - A Close Look at the SQL Query
    - Generating Reports with SQL Date Handling
    - Creating SQL Reports Based on Date Criteria
    - SQL Date Handling and Data Trends
    - Date Handling
    - Introduction to SQL
    - Lies, Damn Lies, Statistics, and SQL







    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 3 hosted by Hostway