SQL Indexing and Performance Part 2: Clustered and Non-Clustered
Άρθρο από Lefteris Karafilis Wed, 01/12/2010 - 11:22
So, why do you need to index your tables?
Because without an index SQL server has to scan entire tables to return requested data. It is like the index page in a book. You check for the keyword you want to read about in the index and you jump directly to the page were the content belongs, instead of scanning page by page for the material you want to read.
Similarly a table index allows you to locate data without the need to scan the entire table. You create indexes on one or more columns in a table to help SQL server find the data quickly in a query. Consider the following example:
If you were to retrieve all computers with computer_id > 5100, SQL would have to check the entire table to return the results without the presence of an index. An index on the computer_id column would speed up this process by sorting the column values.
Now, if you wanted to return all data where computer_id > 5100, SQL would know that it have to move down to the first value greater than 5100 since the column is sorted. This eliminates the need to scan the entire table, thus resulting great performance benefits.
There are two main index types; Clustered index and Non-Clustered index.
A clustered index alters the way that the rows are stored. When you create a clustered index on a column (or a number of columns), SQL server sorts the table’s rows by that column(s). It is like a dictionary, where all words are sorted in alphabetical order in the entire book. Since it alters the physical storage of the table, only one clustered index can be created per table. In the above example the entire rows are sorted by computer_id since a clustered index on computer_id column has been created.
CREATE CLUSTERED INDEX [IX_CLUSTERED_COMPUTER_ID]
ON [dbo].[nics] ([computer_id] ASC)
A non-clustered index, on the other hand, does not alter the way the rows are stored in the table. It creates a completely different object within the table that contains the column(s) selected for indexing and a pointer back to the table’s rows containing the data. It is like an index in the last pages of a book, where keywords are sorted and contain the page number to the material of the book for faster reference. A non-clustered index on the computer_id in the previous example would look like the table below:
CREATE NONCLUSTERED INDEX [IX_NONCLUSTERED_COMPUTER_ID]
ON [dbo].[nics] ([computer_id] ASC)
SQL server sorts the indexes efficiently by using a B-tree, which is a tree data structure that allows SQL Server to keep data sorted, to allow searches, sequential access, insertions and deletions in logarithmic amortized time. This methodology minimizes the number of pages accessed to locate the desired index key, therefore resulting an improved performance.
Relation between clustered and non-clustered indexes
As explained above a non-clustered index contains a pointer back to the rowID (RID) of the table in order to relate the index column with the rest of the columns of a row. But this is not always the case. If a clustered index exist on the table, the non-clustered index instead of RID references is using the clustered index’s key as a row locator. In the above example if a clustered index is created on nic_desc and a non-clustered on computer_id, the non-clustered index would look like the table below.
|5000025||VMware Accelerated AMD PCNet Adapter|
|5000031||Intel(R) PRO/100 VE Network Connection|
|5000045||Broadcom 440x 10/100 Integrated Controller|
|5000046||Broadcom NetXtreme 57xx Gigabit Controller|
|5000055||VMware Accelerated AMD PCNet Adapter|
|5000059||Broadcom 440x 10/100 Integrated Controller|
Index Benefits and Side Effects
A table without a clustered-index is called a “heap table”. A heap table has no sorted data thus SQL server has to scan the entire table in order to locate the data in a process called a “scan”. In the case of a clustered index the data are sorted on the key values (columns) of the index. SQL server is now able to locate the data by navigating down from the root node, to the branch and finally to the leaf nodes of the B-tree structure of the index, in a process called a “seek”. The later approach is much faster when you want to filter or sort the data you want to retrieve.
A non-clustered index from the other side is a completely different object in a table, containing only a subset of columns and a row locator to the table’s rows or to the clustered index’s key. Because of its smaller size (subset of columns), a clustered index can fit more rows in an index page , therefore resulting an improved I/O performance. Furthermore a non-clustered index can be allocated to a different FileGroup which can utilize a different physical storage in order to improve performance even more.
The side effect of indexes is related to the cost of INSERT, UPDATE, MERGE and DELETE statements. Such statements can take longer to execute in the presence of indexes since they alter the data on the table resulting the update of the indexes too. Imagine the situation of an INSERT statement that has to add rows to a table with a clustered index. Table rows may need to be repositioned since clustered index needs to order the data pages themselves thus creating more overhead. So, it is crucial to take into account the overhead of INSERT, UPDATE and DELETE statements before designing your indexing strategy. Although there is an overhead in the above statements, you have to take into account that many times an UPDATE or DELETE statement will have to execute in a subset of data, defined by a WHERE clause, were indexing may outweigh the additional cost of index updates since SQL server has to find the data before updating them.
As explained above a non-clustered index includes the clustered index’s key as its row locator in the case of a clustered index existence on the table; and this comes with a cost and a benefit. The cost has to do with non-clustered index bookmark lookup. What if a query has to return more rows that the ones hosted in the index itself? In the case of a HEAP table, SQL would check the RID of the non-clustered index to navigate directly to the row, were the rest of the columns belong in order to return the results. In the case of a clustered index, SQL would check the row locator of the non-clustered index to do an additional navigation to the B-tree structure of the clustered index to retrieve the desired row, since row locator does not contain the RID but the clustered-index key.
But there is also a benefit of the above relation, between clustered and non-clustered indexes, that has to do with clustered index updates. Imagine the following situation; Two new rows with index key values of A2 and A3 have to be added in the clustered index below.
Since this is a clustered index page, its physical structure has to be reallocated in order to fit A2 and A3 between A1 and A4 to maintain index order. Since there is no free space in the index page to accommodate changes, a page split will occur . Now, there is enough space to fit A2 and A3 between A1 and A4.
The target achieved and the order maintained within the index. But imagine what would happen if the non-clustered index were looking at the RID instead of the clustered index key. It would have to change its row locator table to reflect the changes which could be a huge performance hit in case of large clustered indexes. Since row locator table looks at the clustered index key, it does not have to change its values. This is quite a benefit if you imagine the large clustered indexes usually maintained on many tables.
Clustered indexes VS non-clustered indexes