Indexing (Primary, Clustering)

If you have a Users table with 1 billion rows, and you execute SELECT * FROM Users WHERE id = 500;, how does the database find that exact row?

Without an index, the DBMS must perform a Full Table Scan. It reads row 1 from the hard disk, checks the ID, reads row 2, checks the ID, and so on. Reading 1 billion rows from a mechanical hard disk takes several minutes.

An Index is an auxiliary data structure (usually a Tree or a Hash Table) used to speed up the retrieval of records in response to certain search conditions. It allows the DBMS to find the row in milliseconds.

1. The Concept of an Index

Think of an index in a database exactly like the index at the back of a textbook.

Instead of reading the entire 800-page book to find the word "Algorithm", you flip to the Index.
The Index says: Algorithm: Page 412.
You instantly turn to page 412.

A database index works the same way. It stores the search key (e.g., the User ID) and a physical pointer to the exact disk block where that entire row is stored.

2. Ordered Indices

In an ordered index, index entries are stored sorted on the search key value.

Primary Index

A primary index is an ordered index whose search key also defines the sequential order of the file.

The data file itself is physically sorted on the disk by the primary key.
Because the data can only be physically sorted in one way, a table can have at most one primary index.
Primary indices are typically created automatically on the PRIMARY KEY column of a table.

Clustering Index

If the data file is sequentially ordered on a non-key attribute (an attribute that does not have distinct values, like Department_ID), the index is called a clustering index.

Like the primary index, it relies on the physical sorting of the disk blocks.

Secondary Index

What if the table is physically sorted by ID, but you frequently search by Email? You can create a Secondary Index on the Email column.

The index entries are sorted by Email, but the actual data blocks on the disk remain sorted by ID.
You can create as many secondary indices as you want on a table.

3. Dense vs. Sparse Indices

Dense Index: An index record appears for every search-key value in the file. Finding a record is extremely fast, but the index itself takes up a large amount of memory.
Sparse Index: An index record appears for only some of the values in the file. (e.g., an index entry for the first ID on every disk block). To locate a record, the DBMS finds the largest index entry less than or equal to the search key, jumps to that block, and sequentially scans the single block. Sparse indices take up much less memory.

The Hidden Cost of Indexing: Why not just create an index on every single column? Because every time you INSERT, UPDATE, or DELETE a row, the DBMS must update the table AND every single index attached to it. Adding too many indices makes read queries blazing fast, but makes write operations dreadfully slow.