When I execute DBCC SHOW_STATISTICS, there is a column in the first line of output called “string index”? What kind of an index is that?
Before I answer that question, let me tell you about a special option to DBCC SHOW_STATISTICS. If you’ve ever run this command, you know you get 3 sections of information back. The first section is basic information about the last time the statistics were updated, the number of rows, the number of steps, etc. The second section is density information for each left-based subset of columns. The third section is the histogram for the first column in the statistics. I won't be going into detail on what any of these things mean (i.e. steps, density, histogram), but you can get lots more information from this whitepaper:
What you might not know is that you can get each of the three sections independently by adding an option to the DBCC command:
DBCC SHOW_STATISTICS ('AdventureWorks.Person.Contact',
'IX_Contact_EmailAddress') WITH STAT_HEADER;
DBCC SHOW_STATISTICS ('AdventureWorks.Person.Contact',
'IX_Contact_EmailAddress') WITH DENSITY_VECTOR;
DBCC SHOW_STATISTICS ('AdventureWorks.Person.Contact',
'IX_Contact_EmailAddress') WITH HISTOGRAM;
These options are documented as part of the DBCC SHOW_STATISTICS command for SQL Server 2005. These options were actually available in SQL Server 2000, but they just aren’t documented.
The whitepaper mentions the string index column, and describes its use as follows:
SQL Server 2005 includes patented technology for estimating the selectivity of LIKE conditions. It builds a statistical summary of substring frequency distribution for character columns (a string summary). This includes columns of type text, ntext, char, varchar, and nvarchar. Using the string summary, SQL Server can accurately estimate the selectivity of LIKE conditions where the pattern may have any number of wildcards in any combination.
Basically, what this means is that now there is a way for SQL Server to estimate the cardinality of substrings even if they are not at the beginning of a string. Don’t let the name confuse you. Although a ‘string index’ does maintain an internal structure to keep track of the number of occurrences of various substrings, it is not an index that SQL Server can use to access the desired data rows. For our purposes, it’s probably better to think of this feature as ‘string statistics’.
Having string statistics available allows the optimizer to estimate the number of occurrences for some substrings that occur other than at the beginning of a column of character data (char, varchar, text, etc). If there is an index on the character column, or that includes the character column, SQL Server can choose to scan that index, if the number of rows containing the substring is estimated to be sufficiently small. Accessing the table using an index key from a nonclustered index is called a ‘bookmark lookup’. A bookmark lookup can be either a lookup of a clustered index key, in which case your graphical SHOWPLAN output will indicate ‘key lookup’ or it can be a lookup of a particular Row ID, in which case your graphical SHOWPLAN output will indicate ‘RID lookup’. These are usually considered expensive operations IF SQL Server has to do a lot of them. In addition, scanning the leaf level is expensive, but not as expensive (usually) as scanning the entire table or clustered index. So the combination of scanning the nonclustered leaf plus performing some bookmark lookups is considered a good plan if the number of bookmark lookups is very small.
The optimizer cannot use the substring cardinality information to decide to use an index SEEK. An index SEEK can only be used when there is a specific value in the ordered leaf level that SQL Server can seek TO. Any time you are matching a pattern with a leading %, there is no way to get to a specific value.
Let’s look at an example using the table 'AdventureWorks.Person.Contact' referenced in the DBCC SHOW_STATISTICS command above.
USE AdventureWorks;
SELECT ContactID, FirstName, LastName, EmailAddress
FROM Person.Contact
WHERE EmailAddress LIKE N'%d19%';
SELECT ContactID, FirstName, LastName, EmailAddress
FROM Person.Contact
WHERE EmailAddress LIKE N'%son%';
Look at the SHOWPLAN output for each of these queries. There are 19971 rows in the table, and the first query only returns 16 of them. The second query returns 315. The optimizer uses the string statistics to estimate that the substring ‘d19’ occurs few enough times to use a leaf level scan followed by key (bookmark) lookups. On the other hand, the substring ‘son’ occurs too often to be efficient. A nonclustered leaf scan plus 315 key lookups would be less efficient than just scanning the entire table.
So keep in mind that without these special string statistics, the index on EmailAddress could never be used at all for queries like the above, when we are searching for substring in the middle of a column, no matter how many rows might be returned. The power of the string indexes is to be able to get cardinality estimates for substrings that aren't at the beginning of a string. There was no way to do this before SQL Server 2005.