Let’s explore alternative ways to list our buyers of Batmobiles. In my view, as a general rule, distinct at the top level should be avoided whenever possible. The reason is that if we have overlooked a join condition, a distinct will hide the problem. Admittedly this is a greater risk when building queries with the older syntax, but nevertheless still a risk when using the ANSI/SQL92 syntax if tables are joined through several columns. It is usually much easier to spot duplicate rows than it is to identify incorrect data.
It’s easy to give a proof of the assertion that incorrect results may be difficult to spot: the two previous queries that usedistinctto return the names of the customers may actually return a wrong result. If we happen to have several customers named “Wayne,” we won’t get that information becausedistinctwill not only remove duplicates resulting from multiple orders by the same customer, but also remove duplicates resulting from homonyms. In fact, we should return both the unique customer id and the customer name to be certain that we have the full list of Batmobile buyers. We can only guess at how long it might take to identify such an issue in production.
How can we get rid ofdistinctthen? By acknowledging that we are looking for customers in Gotham that satisfy an existence test, namely a purchase order for a Batmobile in the past six months. Note that most, but not all, SQL dialects support the following syntax:
select c.custname from customers c where c.city = 'GOTHAM' and exists (select null from orders o, orderdetail od, articles a where a.artname = 'BATMOBILE' and a.artid = od.artid and od.ordid = o.ordid and o.custid = c.custid and o.ordered >= somefunc)
If we use an existence test such as this query uses, a name may appear more than once if it is common to several customers, but each individual customer will appear only once, irrespective of the number of orders they placed. You might think that my criticism of the ANSI SQL syntax was a little harsh, sincecustomers figure as prominently, if not more prominently than before. However, it now features as the source for the data we want the query to return. And another query, nested this time, appears as a major phase in the identification of the subset of customers.
The inner query in the preceding example is strongly linked to the outerselect. As you can see on line 11 (in bold), the inner query refers to the current row of the outer query. Thus, the inner query is what is called a correlated subquery. The snag with this type of subquery is that we cannot execute it before we know the current customer. Once again, we are assuming that the optimizer doesn’t rewrite the query. Therefore we must first find each customer and then check for each one whether the existence test is satisfied. Our query as a whole may perform excellently if we have very few customers in Gotham. It may be dreadful if Gotham is the place where most of our customers are located (a case in which a sophisticated optimizer might well try to execute the query in a different way).
We have still another way to write our query, which is as follows:
select custname from customers where city = 'GOTHAM' and custid in (select o.custid from orders o, orderdetail od, articles a where a.artname = 'BATMOBILE' and a.artid = od.artid and od.ordid = o.ordid and o.ordered >= somefunc)
In this case, the inner query no longer depends on the outer query: it has become an uncorrelated subquery. It needs to be executed only once. It should be obvious that we have now reverted the flow of execution. In the previous case, we had to search first for customers in the right location (e.g., where city is Gotham), and then check each order in turn. In this latest version of the query, the identifiers of customers who have ordered what we are looking for are obtained via a join that takes place in the inner query.
If you have a closer look, however, there are more subtle differences as well between the current and preceding examples. In the case of the correlated subquery, it is of paramount importance to have theorderstable indexed oncustid; in the second case, it no longer matters, since then the index (if any) that will be used is the index associated with the primary key ofcustomers.
You might notice that the most recent version of the query performs an implicitdistinct. Indeed, the subquery, because of its join, might return many rows for a single customer. That duplication doesn’t matter, because theincondition checks only to see whether a value is in the list returned by the subquery, andindoesn’t care whether a given value is in that list one time or a hundred times. Perhaps though, for the sake of consistency we should apply the same rules to the subquery that we have applied to the query as a whole, namely to acknowledge that we have an existence test within the subquery as well:
select custname from customers where city = 'GOTHAM' and custid in (select o.custid from orders o where o.ordered >= somefunc and exists (select null from orderdetail od, articles a where a.artname = 'BATMOBILE' and a.artid = od.artid and od.ordid = o.ordid))
select custname from customers where city = 'GOTHAM' and custid in (select custid from orders where ordered >= somefunc and ordid in (select od.ordid from orderdetail od, articles a where a.artname = 'BATMOBILE' and a.artid = od.artid)
Irrespective of the fact that our nesting is getting deeper and becoming less legible, choosing which query is the best between theexistsand theinfollows the very same rule inside the subquery as before: the choice depends on the effectiveness of the condition on the date versus the condition on the article. Unless business has been very, very slow for the past six months, one might reasonably expect that the most efficient condition on which to filter the data will be the one on the article name. Therefore, in the particular case of the subquery,inis better thanexists because it will be faster to find all the order lines that refer to a Batmobile and then to check whether the sale occurred in the last six months rather than the other way round. This approach will be faster assuming that the tableorderdetailis indexed onartid; otherwise, our bright, tactical move will fail dismally.
It may be a good idea to check in against exists whenever an existence test is applied to a significant number of rows.
Most SQL dialects allow you to rewrite uncorrelated subqueries as inline views in thefromclause. However, you must always remember that aninperforms an implicit removal of duplicate values, which must become explicit when the subquery is moved to become an in-line view in thefromclause. For example:
select custname from customers where city = 'GOTHAM' and custid in (select o.custid from orders o, (select distinct od.ordid from orderdetail od, articles a where a.artname = 'BATMOBILE' and a.artid = od.artid) x where o.ordered >= somefunc and x.ordid = o.ordid)
The different ways you have to write functionally equivalent queries (and variants other than those given in this section are possible) are comparable to words that are synonyms. In written and spoken language, synonyms have roughly the same meaning, but each one introduces a subtle difference that makes one particular word more suitable to a particular situation or expression (and in some cases another synonym is totally inappropriate). In the same way, both data and implementation details may dictate the choice of one query variant over others.