Home arrow SQL arrow Focusing SQL Queries
SQL

Focusing SQL Queries


In this conclusion to a three-part series that examines the SQL query, you will learn some alternative ways of phrasing a query, and how to query large quantities of data. This article is excerpted from chapter four of The Art of SQL, written by Stephane Faroult and Peter Robson (O'Reilly; ISBN: 0596008945). Copyright © 2006 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

Author Info:
By: O'Reilly Media
Rating: 4 stars4 stars4 stars4 stars4 stars / 26
January 17, 2008
TABLE OF CONTENTS:
  1. · Focusing SQL Queries
  2. · Lessons to be learned from the Batmobile trade
  3. · Querying Large Quantities of Data
  4. · The Proportions of Retrieved Data

print this article
SEARCH DEVARTICLES

Focusing SQL Queries
(Page 1 of 4 )

More Batmobile purchases

Let’s explore alternative ways to list our buyers of Batmobiles. In my view, as a general rule, distinct at the top level should be avoided whenever possible. The reason is that if we have overlooked a join condition, a distinct will hide the problem. Admittedly this is a greater risk when building queries with the older syntax, but nevertheless still a risk when using the ANSI/SQL92 syntax if tables are joined through several columns. It is usually much easier to spot duplicate rows than it is to identify incorrect data.

It’s easy to give a proof of the assertion that incorrect results may be difficult to spot: the two previous queries that usedistinctto return the names of the customers may actually return a wrong result. If we happen to have several customers named “Wayne,” we won’t get that information becausedistinctwill not only remove duplicates resulting from multiple orders by the same customer, but also remove duplicates resulting from homonyms. In fact, we should return both the unique customer id and the customer name to be certain that we have the full list of Batmobile buyers. We can only guess at how long it might take to identify such an issue in production.

How can we get rid ofdistinctthen? By acknowledging that we are looking for customers in Gotham that satisfy an existence test, namely a purchase order for a Batmobile in the past six months. Note that most, but not all, SQL dialects support the following syntax:

  select c.custname
  from customers c
  where c.city = 'GOTHAM'
   
and exists (select null
              
from orders o,
                   
orderdetail od,
                   
articles a
              
where a.artname = 'BATMOBILE'
                
and a.artid = od.artid
                
and od.ordid = o.ordid
                
and o.custid = c.custid
                 and o.ordered >= somefunc)

If we use an existence test such as this query uses, a name may appear more than once if it is common to several customers, but each individual customer will appear only once, irrespective of the number of orders they placed. You might think that my criticism of the ANSI SQL syntax was a little harsh, sincecustomers figure as prominently, if not more prominently than before. However, it now features as the source for the data we want the query to return. And another query, nested this time, appears as a major phase in the identification of the subset of customers.

The inner query in the preceding example is strongly linked to the outerselect. As you can see on line 11 (in bold), the inner query refers to the current row of the outer query. Thus, the inner query is what is called a correlated subquery. The snag with this type of subquery is that we cannot execute it before we know the current customer. Once again, we are assuming that the optimizer doesn’t rewrite the query. Therefore we must first find each customer and then check for each one whether the existence test is satisfied. Our query as a whole may perform excellently if we have very few customers in Gotham. It may be dreadful if Gotham is the place where most of our customers are located (a case in which a sophisticated optimizer might well try to execute the query in a different way).

We have still another way to write our query, which is as follows:

  select custname
  from customers
  where city = 'GOTHAM'
    and custid in
              
(select o.custid
              
from orders o,
                   
orderdetail od,
                   
articles a
              
where a.artname = 'BATMOBILE'
                
and a.artid = od.artid
                
and od.ordid = o.ordid
                
and o.ordered >= somefunc)

In this case, the inner query no longer depends on the outer query: it has become an uncorrelated subquery. It needs to be executed only once. It should be obvious that we have now reverted the flow of execution. In the previous case, we had to search first for customers in the right location (e.g., where city is Gotham), and then check each order in turn. In this latest version of the query, the identifiers of customers who have ordered what we are looking for are obtained via a join that takes place in the inner query.

If you have a closer look, however, there are more subtle differences as well between the current and preceding examples. In the case of the correlated subquery, it is of paramount importance to have theorderstable indexed oncustid; in the second case, it no longer matters, since then the index (if any) that will be used is the index associated with the primary key ofcustomers.

You might notice that the most recent version of the query performs an implicitdistinct. Indeed, the subquery, because of its join, might return many rows for a single customer. That duplication doesn’t matter, because theincondition checks only to see whether a value is in the list returned by the subquery, andindoesn’t care whether a given value is in that list one time or a hundred times. Perhaps though, for the sake of consistency we should apply the same rules to the subquery that we have applied to the query as a whole, namely to acknowledge that we have an existence test within the subquery as well:

  select custname
  from customers
  where city = 'GOTHAM'
   
and custid in
              
(select o.custid
                from orders o
                where o.ordered >= somefunc

                  and exists (select null
                              from orderdetail od,
                                  articles a
                             
where a.artname = 'BATMOBILE'
                                and a.artid = od.artid
                                and od.ordid = o.ordid))

or:

  select custname
  from customers
  where city = 'GOTHAM'
   
and custid in
              
(select custid
                from orders
                where ordered >= somefunc

                  and ordid in (select od.ordid
                                from orderdetail od,
                                  articles a
                                where a.artname = 'BATMOBILE'
                                  and a.artid = od.artid)

Irrespective of the fact that our nesting is getting deeper and becoming less legible, choosing which query is the best between theexistsand theinfollows the very same rule inside the subquery as before: the choice depends on the effectiveness of the condition on the date versus the condition on the article. Unless business has been very, very slow for the past six months, one might reasonably expect that the most efficient condition on which to filter the data will be the one on the article name. Therefore, in the particular case of the subquery,inis better thanexists because it will be faster to find all the order lines that refer to a Batmobile and then to check whether the sale occurred in the last six months rather than the other way round. This approach will be faster assuming that the tableorderdetailis indexed onartid; otherwise, our bright, tactical move will fail dismally.

NOTE

It may be a good idea to check in against exists whenever an existence test is applied to a significant number of rows.

Most SQL dialects allow you to rewrite uncorrelated subqueries as inline views in thefromclause. However, you must always remember that aninperforms an implicit removal of duplicate values, which must become explicit when the subquery is moved to become an in-line view in thefromclause. For example:

  select custname
  from customers
  where city = 'GOTHAM'
    and custid in
              
(select o.custid
               
from orders o,
                    
(select distinct od.ordid
                     
from orderdetail od,
                          
articles a
                     
where a.artname = 'BATMOBILE'
                       
and a.artid = od.artid) x
               
where o.ordered >= somefunc
                 
and x.ordid = o.ordid)

The different ways you have to write functionally equivalent queries (and variants other than those given in this section are possible) are comparable to words that are synonyms. In written and spoken language, synonyms have roughly the same meaning, but each one introduces a subtle difference that makes one particular word more suitable to a particular situation or expression (and in some cases another synonym is totally inappropriate). In the same way, both data and implementation details may dictate the choice of one query variant over others.


blog comments powered by Disqus
SQL ARTICLES

- Focusing SQL Queries
- Complex SQL Queries
- A Close Look at the SQL Query
- Generating Reports with SQL Date Handling
- Creating SQL Reports Based on Date Criteria
- SQL Date Handling and Data Trends
- Date Handling
- Introduction to SQL
- Lies, Damn Lies, Statistics, and SQL

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials