A Close Look at the SQL Query - Five Factors Governing the Art of SQL (Page 4 of 4 )
You have seen in the first part of this chapter exactly how SQL includes both relational and non-relational characteristics. You have also seen how this affects the efficient (and not-so-efficient) workings of the database optimizer. From this point forward, and bearing in mind the lessons of the first part of this chapter, we can concentrate on the key factors that must be considered when using SQL. In my view, there are five main factors:
The total quantity of data from which a result set has to be obtained
The criteria required to define the result set
The size of the result set
The number of tables to be processed in order to obtain the desired result set
The number of other users also modifying this same data
Total Quantity of Data
The volume of data we need to read is probably the most important factor to take into account; an execution plan that is perfectly suitable for a fourteen-row emp table and a four-rowdepttable may be entirely inappropriate for dealing with a 15 million–rowfinancial_flowstable against which we have to join a 5 million–rowproductstable. Note that even a 15 million–row table will not be considered particularly large by the standards of many companies. As a matter of consequence, it is hard to pronounce on the efficiency of a query before having run it against the target volume of data.
Criteria Defining the Result Set
When we write an SQL statement, in most cases it will involve filtering conditions located in where clauses, and we may have severalwhereclauses—a major one as well as minor ones—in subqueries or views (regular views or in-line views). A filtering condition may be efficient or inefficient. However, the significance of efficient or inefficient is strongly affected by other factors, such as physical implementation (as discussed in Chapter 5) and once again, by how much data we have to wade through.
We need to approach the subject of defining the result in several parts, by considering filtering, the central SQL statements, and the impact of large data volumes on our queries. But this is a particularly complex area that needs to be treated in some depth, so I’ll reserve this discussion until later in this chapter, in the major section entitled “Filtering.”
Size of the Result Set
An important and often overlooked factor is how much data a query returns (or how much data a statement changes). This is often dependent on the size of the tables and the details of the filtering, but not in every case. Typically, the combination of several selection criteria which of themselves are of little value in selecting data may result in highly efficient filtering when used in combination with one another. For example, one could cite that retrieving students’ names based on whether they received a science or an arts degree will give a large result set, but if both criteria are used (e.g., students who studied under both disciplines) the consequent result set will collapse to a tiny number.
In the case of queries in particular, the size of the result set matters not so much from a technical standpoint, but mostly because of the end user’s perception. To a very large extent, end users adjust their patience to the number of rows they expect: when they ask for one needle, they pay little attention to the size of the haystack. The extreme case is a query that returns nothing, and a good developer should always try to write queries that return few or no rows as fast as possible. There are few experiences more frustrating than waiting for several minutes before finally seeing a “no data found” message. This is especially annoying if you have mistyped something, realized your error just after hitting Enter, and then have been unable to abort the query. End users are willing to wait to get a lot of data, but not to get an empty result. If we consider that each of our filtering criteria defines a particular result set and the final result set is either the intersection (when conditions areanded) or the union (when conditions areored together) of all the intermediate result sets, a zero result is most likely to result from the intersection of small, intermediate result sets. In other words, the (relatively) most precise criteria are usually the primary reason for a zero result set. Whenever there is the slightest possibility that a query might return no data, the most likely condition that would result in a null return should be checked first—especially if it can be done quickly. Needless to say, the order of evaluation of criteria is extremely context-sensitive as you shall see later under “Filtering.”
A skillful developer should aim for response times proportional to the number of rows returned.
Number of Tables
The number of tables involved in a query will naturally have some influence on performance. This is not because a DBMS engine performs joins badly—on the contrary, modern systems are able to join large numbers of tables very efficiently.
The perception of poor join performance is another enduring myth associated with relational databases. Folklore has it that one should not join too many tables, with five often suggested as the limit. In fact, you can quite easily have 15 table joins perform extremely well. But there are additional problems associated with joining a large number of tables, of which the following are examples:
When you routinely need to join, say, 15 tables, you can legitimately question the correctness of the design; keep in mind what I said in Chapter 1—that a row in a table states some kind of truth and can be compared to a mathematical axiom. By joining tables, we derive other truths. But there is a point at which we must decide whether something is an obvious truth that we can call an axiom, or whether it is a less obvious truth that we must derive. If we spend much of our time deriving our truths, perhaps our axioms are poorly chosen in the first place.
For the optimizer, the complexity increases exponentially as the number of tables increases. Once again, the excellent work usually performed by a statistical optimizer may comprise a significant part of the total response time for a query, particularly when the query is run for the first time. With large numbers of tables, it is quite impractical for the optimizer to explore all possible query paths. Unless a query is written in a way that eases the work of the optimizer, the more complex the query, the greater the chance that the optimizer will bet on the wrong horse.
When we write a complex query involving many tables, and when joins can be written in several fairly distinct ways, the odds are high that we’ll pick the wrong construct. If we join tables A to B to C to D, the optimizer may not have all the information present to know that A can be very efficiently joined directly to D, particularly if that join happens to be a special case. A sloppy developer trying to fix duplicate rows with adistinctcan also easily overlook a missing join condition.
Please check back next week for the continuation of this article.
DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware.