Complex SQL Queries - Buyers of Batmobiles (Page 4 of 4 )
Assume that we have four tables, namely customers,orders,orderdetail, and a table ofarticles, as shown in Figure 4-5. Please note that in the figure the sizes of the boxes representing each table are more or less proportional to the volume of data in each table, not simply to the number of columns. Primary key columns are underlined.
Figure 4-5. A classical order schema
Let’s now suppose that our SQL problem is to find the names of all the customers living in the city named “Gotham” who have ordered the article called “Batmobile” during the last six months. We have, of course, several ways to formulate this query; the following is probably what an ANSI SQL fan would write:
select distinct c.custname from customers c join orders o on o.custid = c.custid join orderdetail od on od.ordid = o.ordid join articles a on a.artid = od.artid where c.city = 'GOTHAM' and a.artname = 'BATMOBILE' and o.ordered >= somefunc
somefunc is supposed to be a function that returns the date six months prior to the current date. Notice too, the presence ofdistinct, which may be required if one of our customers is an especially heavy consumer of Batmobiles and has recently ordered several of them.
Let’s forget for a while that the optimizer may rewrite the query, and look at the execution plan such a statement suggests. First, we walk thecustomerstable, keeping only rows for which the city happens to be Gotham. Then we search theorderstable, which means that thecustidcolumn there had better be indexed, because otherwise the only hope the SQL engine has of executing the query reasonably fast is to perform some sorting and merging or to scan theorderstable to build a hash table and then operate against that. We are going to apply another filter at this level, against the order date. A clever optimizer will not mind finding the filtering condition in thewhereclause and will understand that in order to minimize the amount of data to join it must filter on the date before performing the join. A not so clever optimizer might be tempted to join first, and then filter, and may therefore be grateful to you for specifying the filtering condition with the join condition, as follows:
join orders o on o.custid = c.custid and a.ordered >= somefunc
Even if the filtering condition really has nothing to do with the join, it is sometimes difficult for the optimizer to understand when that is the case. If the primary key oforderdetail is defined as(ordid, artid)then, becauseordid is the first attribute of the index, we can make use of that index to find the rows associated with an order as in Chapter 3. But if the primary key happens to be(artid, ordid)(and note, either version is exactly the same as far as relational theory is concerned), then tough luck. Some products may be able to make some use of the index* in that case, but it will not provide the efficient access that(ordid, artid)would have allowed. Other products will be totally unable to use the index. The only circumstance that may save us is the existence of a separate index onordid.
Once we have linkedorderdetails toorders, we can proceed toarticles--without any problem this time since we foundartid, the primary key, inorderdetail. Finally, we can check whether the value inarticles is or is not a Batmobile. Is this the end of the story? Not quite. As instructed bydistinct, we must now sort the resulting set of customer names that have passed across all the filtering layers so as to eliminate duplicates.
It turns out that there are several alternative ways of expressing the query that I’ve just described. One example is to use the older join syntax, as follows:
select distinct c.custname from customers c, orders o, orderdetail od, articles a where c.city = 'GOTHAM' and c.custid = o.custid and o.ordid = od.ordid and od.artid = a.artid and a.artname = 'BATMOBILE' and o.ordered >= somefunc
It may just be old habits dying hard, but I prefer this older way, if only for one simple reason: it makes it slightly more obvious that from a logical point of view the order in which we process data is arbitrary, because the same data will be returned irrespective of the order in which we inspect tables. Certainly thecustomers tableis particularly important, since that is the source from which we obtain the data that is ultimately required, while in this very specific context, all the other tables are used purely to support the remaining selection processes. One really has to understand that there is no one recipe that works for all cases. The pattern of table joins will vary for each situation you encounter. The deciding factor is the nature of the data you are dealing with.
A given approach in SQL may solve one problem, but make another situation worse. The way queries are written is a bit like a drug that may heal one patient but kill another.
Please check back next week for the conclusion to this article.
DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware.