Database Recovery Techniques

A computer system, like any other physical device, is subject to failure from a variety of causes: disk crashes, power outages, software errors, or fires in the data center.

A DBMS must guarantee the Atomicity and Durability properties of transactions. If the power cord is pulled out of the server halfway through a massive bank transfer, the DBMS must be able to completely recover the system to a consistent state when the power comes back on.

1. The Write-Ahead Log (WAL)

The absolute most important component of database recovery is the Write-Ahead Log (WAL) (also called the Redo Log).

To maximize performance, databases do not immediately write updated rows to the actual data files on the hard disk. Instead, they modify the data in memory (RAM). However, RAM is volatile—if the power goes out, the data is lost.

To guarantee Durability, the DBMS utilizes the WAL protocol:

Before any change is made to the database, a log record describing the change must be written to stable storage (the WAL file on the hard disk).
A transaction is not considered "Committed" until its commit record is safely written to the WAL.
Only after the WAL is updated on disk does the DBMS update the actual database data files in the background.

Writing to a sequential log file is extremely fast because it is an "append-only" operation (no disk arm seeking required), which is why databases use this instead of immediately updating the massive data files.

2. Checkpointing

Over time, the WAL file grows massively. If a database crashes after running for a year, reading the entire year-long WAL to rebuild the state would take days.

To solve this, the DBMS periodically performs a Checkpoint:

It pauses all new transactions.
It forces all modified data in RAM (dirty blocks) to be permanently written to the actual database data files on the disk.
It appends a <checkpoint> record to the WAL.
It resumes normal operation.

If the system crashes, the recovery manager only needs to read the WAL backwards until it hits the most recent <checkpoint>. It knows that any transactions completed before that checkpoint are safely stored in the data files and can be ignored.

3. Recovery via ARIES Algorithm

ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) is the industry-standard recovery algorithm used by almost all modern relational databases.

When a crashed database is rebooted, ARIES performs three distinct passes over the Write-Ahead Log:

Analysis Pass: Reads the log forward from the last checkpoint to identify which transactions were active at the exact moment of the crash, and determines the exact starting point for the Redo pass.
Redo Pass: Reads the log forward from the starting point. It blindly reapplies every single change recorded in the log, effectively bringing the database back to the exact physical state it was in a millisecond before the crash.
Undo Pass: Now that the database is restored to the crash state, it contains the half-finished work of transactions that never committed. The Undo pass reads the log backwards, finding the active (uncommitted) transactions identified in the Analysis pass, and reverses their changes, guaranteeing Atomicity.

Database Recovery Techniques

A computer system, like any other physical device, is subject to failure from a variety of causes: disk crashes, power outages, software errors, or fires in the data center.

1. The Write-Ahead Log (WAL)

The absolute most important component of database recovery is the Write-Ahead Log (WAL) (also called the Redo Log).

To guarantee Durability, the DBMS utilizes the WAL protocol:

Before any change is made to the database, a log record describing the change must be written to stable storage (the WAL file on the hard disk).

A transaction is not considered "Committed" until its commit record is safely written to the WAL.

Only after the WAL is updated on disk does the DBMS update the actual database data files in the background.

2. Checkpointing

Over time, the WAL file grows massively. If a database crashes after running for a year, reading the entire year-long WAL to rebuild the state would take days.

To solve this, the DBMS periodically performs a Checkpoint:

It pauses all new transactions.

It forces all modified data in RAM (dirty blocks) to be permanently written to the actual database data files on the disk.

It appends a <checkpoint> record to the WAL.

It resumes normal operation.

3. Recovery via ARIES Algorithm

ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) is the industry-standard recovery algorithm used by almost all modern relational databases.

When a crashed database is rebooted, ARIES performs three distinct passes over the Write-Ahead Log:

Analysis Pass: Reads the log forward from the last checkpoint to identify which transactions were active at the exact moment of the crash, and determines the exact starting point for the Redo pass.

Redo Pass: Reads the log forward from the starting point. It blindly reapplies every single change recorded in the log, effectively bringing the database back to the exact physical state it was in a millisecond before the crash.

Undo Pass: Now that the database is restored to the crash state, it contains the half-finished work of transactions that never committed. The Undo pass reads the log backwards, finding the active (uncommitted) transactions identified in the Analysis pass, and reverses their changes, guaranteeing Atomicity.