The Microsoft SQL Server database engine uses a cost-based query optimizer to automatically optimize data manipulation queries that are submitted using SQL. A data manipulation query is any query that supports the WHERE or HAVING keywords in SQL, such as SELECT, DELETE, and UPDATE. The cost-based query optimizer produces cost estimates for a clause based on statistical information.
This optimization is accomplished in three phases:
In the query analysis phase, the SQL Server optimizer looks at each clause represented by the canonical query tree and determines whether it can be optimized. SQL Server attempts to optimize clauses that limit a scan, such as search or join clauses. However, not all valid SQL syntax can be broken into optimizable clauses—for example, clauses containing the SQL relational operator <> (not equal to). Because <> is an exclusive rather than an inclusive operator, the selectivity of the clause cannot be determined before scanning the entire underlying table. When a relational query contains nonoptimizable clauses, the execution plan accesses these portions of the query using table scans. If the query tree contains any optimizable SQL syntax, the optimizer performs index selection for each of these clauses.
For each optimizable clause, the optimizer checks the database system tables to see if there is an associated index useful for accessing the data. An index is considered useful only if a prefix of the columns contained in the index exactly matches the columns in the clause of the query. This must be an exact match because an index is built based on the column order presented at creation time. For a clustered index, the underlying data is also sorted based on this index column order. Attempting to use only a secondary column of an index to access data would be similar to attempting to use a phone book to look up all the entries with a particular first name: The ordering would be of little use because you would still have to check every row to find all of the qualifying entries. If a useful index exists for a clause, the optimizer then attempts to determine the clause selectivity.
When cost-based optimization was previously mentioned, it was stated that a cost-based optimizer produces cost estimates for a clause based on statistical information. This statistical information is used to estimate a clause's selectivity (the percentage of tuples in a table that are returned for the clause). Microsoft SQL Server stores this statistical information in a specialized data distribution page associated with each index. This statistical information is updated only:
To provide SQL Server with statistics that accurately reflect the actual tuple distribution of a populated table, the database system administrator must keep current the statistical information for the table indexes. If no statistical information is available for the index, a heuristic based on the relational operator of the clause is used to produce an estimate of selectivity.
Information about the selectivity of the clause and the type of index available is used to calculate a cost estimate for the clause. SQL Server estimates the amount of physical disk I/O that occurs if the index is used to retrieve the result set from the table. If this estimate is lower than the physical I/O cost of scanning the entire table, an access plan that employs the index is created.
When index selection is complete and all clauses have an associated processing cost based on their access plan, the optimizer performs join selection. Join selection is used to find an efficient order for combining the clause access plans. To accomplish this, the optimizer compares various orderings of the clauses and then selects the join plan with the lowest estimated processing costs in terms of physical disk I/O. Because the number of clause combinations can grow combinatorially as the complexity of a query increases, the SQL Server query optimizer uses tree pruning techniques to minimize the overhead associated with these comparisons. When this join selection phase is complete, the SQL Server query optimizer provides a cost-based query execution plan that takes advantage of available indexes (when they are useful) and accesses the underlying data in an order that minimizes system overhead and improves performance.