apache > db
Apache DB Project
 
Font size:      

Derby Logging and Recovery

This document describes how Derby implements logging and recovery. This is a work-in-progress derived from Javadoc comments and from explanations Mike Matrigali and others posted to the Derby lists. Please post questions, comments, and corrections to derby-dev@db.apache.org.

Introduction

Derby transaction logging and recovery is based upon the ARIES algorithm.

ARIES - An Overview

Following is a brief description of the main principles behind ARIES.

Firstly, in ARIES, changes always take the system forward. That is to say, even transaction rollbacks are treated as if they are updates to the system. This is counter-inituitive to what the user thinks, because when a user asks for a transaction to be rolled back, they assume that the system is going back to a previous state of affairs. However, from the perspective of ARIES, there is no such thing as going back. For example, if a transaction changes A to B and then rolls back, ARIES treats the rollback as simply an update that changes B to A. The forward change from A to B (redo) and the reversal of B to A (undo) are both recorded as updates to the system. Changes during normal operations are recorded as Redo-Undo log records. As the name implies, these log records can be 'redone' in case of a system crash, or 'undone' in case a rollback is required. Changes made during rollbacks, however, are recorded as Redo-only log records. These log records are called Compensation Log Records (CLRs). The reason these are redo only is that by definition a rollback does not need to be undone, whereas normal updates need to be undone if the transaction decides to rollback.

The second basic principle of ARIES is that during recovery, history is repeated. This can be explained as follows.

When a system crashes, there would be some transactions that have completed (committed or aborted), and others that are still active. The WAL protocol ensures that changes made by completed transactions have been recorded in the Log. Changes made by incomplete transactions may also be present in the Log, because Log Records are created in the same order as the changes are made by the system.

During recovery, ARIES initially replays the Log to the bring the system back to a state close to that when the crash occurred. This means that ARIES replays the effects of not only those transactions that committed or aborted, but also those that were active at the time of the crash. Having brought the system to this state, ARIES then identifies transactions that were incomplete, and rolls them back. The basic idea is to repeat the entire history upto the point of crash, and then undo failed transactions.

This approach has the advantage that during the redo phase, changes can be replayed at a fairly low level, for example, the level of a disk page. ARIES calls this page oriented redo. This feature is significant because it means that until the redo phase is over, the system does not need to know about higher level data structures such as Indexes. Only during the undo phase, when incomplete transactions are being rolled back, does the system need to know about high level data structures.

Features of ARIES

ARIES includes a number of optimisations to reduce the amount of work required during normal operations and recovery.

One optimisation is to avoid application of log records unnecessarily. The LSN of the most recently generated log record is stored in each disk page. This is known as the PageLsn. The PageLsn allows ARIES to determine during the redo phase, whether the changes represented by a log record have been applied to the page or not.

ARIES chains log records for transactions in such a way that those records that are no longer necessary, are skipped during recovery. For example, if a transaction changed A to B, and then rolled back, generating a log record for changing B to A, then during recovery, ARIES would automatically skip the log record that represents the change from A to B. This is made possible by maintaining a UndoLsn pointer in every Log Record. The UndoLsn normally points to the previous log record generated by the transaction. However, in log records generated during Rollback (known as Compensation Log Records), the UndoLsn is made to point to the Log record preceding the one that is being undone. To take an example, let us assume that a transaction generated log record 1, containing change from A to B, then log record 2 containing change from B to C. At this point the transaction decides to rollback the change from B to C. It therefore generates a new log record 3, containing a change from C to B. The UndoLsn of this log record is made to point at log record 1, instead of log record 2. When following the UndoLsn chain, ARIES would skip log record 2.

ARIES also supports efficient checkpoints. During a checkpoint, it is not necessary to flush all database pages to disk. Instead ARIES records a list of dirty buffer pages along with their RecoveryLsn(s). The RecoveryLsn of a page is the LSN of the earliest log record that represents a change to the page since it was read from disk. By using this list, ARIES is able to determine during recovery, where to start replaying the Log.

ARIES supports nested top-level action concept whereby part of a transaction can be commited even if the transaction aborts. This is useful for situations where a structural change should not be undone even if the transaction aborts. Nested top level actions are implemented using Dummy Compensation Log Records - and make use of the ability to skip logs records using the UndoLsn pointer as described previously.

References

  1. For a full description of ARIES, please see Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., Schwarz, P. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992, pp94-162. A version of this document is freely available as IBM Research Report RJ6649.

  2. A good description of Write Ahead Logging, and how a log is typically implemented, can be found in Transaction Processing: Concepts and Techniques , by Jim Gray and Andreas Reuter, 1993, Morgan Kaufmann Publishers .

Derby implementation of ARIES

I shall only describe how Derby differs from standard ARIES implementation. Therefore, for a full understanding of the logging and recovery mechanisms in Derby, it is necessary to consult above mentioned papers on ARIES.

Derby uses Log Sequence Numbers to identify Log records. In Derby terminology, LSNs are called LogInstants. LogCounter is an implementation of LogInstant.

Although Derby uses LogInstant, it does not save this with the page data. Instead, a page version number is stored. The page version number is also stored in the log records associated with the page. During recovery (redo), Derby uses the page version to determine whether the page needs redo or not. Here is a comment on the rationale behind this:

Mike Matrigali:
Am going to defer on page version vs. LSN question, but at least mention some guesses, not researched info. You are right bout what exists. I spoke with some colleagues and the best we can come up with is that the implementor wanted to separate the page and the log, in case we ever did a different log format. I will try to do some more research here. I also vaguely remember the implementor mentioning if we ever wanted to implement the LSN on the page, we had space to do so. It may simply have been easier to code the page versions, since in the current system the LSN is the address in the log file (which and it may not be available when the code wants to write it on the page).
As you say in derby all the log records are associated with a page, and thus have a page version number. That page version number in the log is compared with the page version number of the page during redo to determine if the log record needs to be applied. This also has helped us in the past to catch some bugs as we can sanity check during redo that the page is progressing along the expected path, ie. it is a bug during redo to be applying a page version 10 log record to page that is at page version 8. I haven't seen this sanity check in many years, but was useful when the product was first coded.

Derby does not write the dirty pages list within a Checkpoint record. Instead, during checkpoint, Derby flushes all database pages to disk. The redo Low Water Mark (redoLWM) is set to the current LSN when the checkpoint starts. The undo Low Water Mark (undoLWM) is set to the starting LSN of the oldest active transaction. At restart, Derby replays the log from redoLWM or undoLWM whichever is earlier. For a good description of concepts behind the checkpoint method used by Derby, and the use of redo/undo Low Water Marks, please refer to TPCT book (Section 11.3).

Derby uses 'internal' transactions instead of nested top-level actions to separate structural changes from normal operations. Internal transactions have the property that they are always page-oriented and do not require logical undo, ie, undo is always physical. Also, during recovery, incomplete internal transactions are undone before any regular transactions. In ARIES, no special processing is required to handle this, as nested top-level actions are automatically handled as part of normal redo, and are skipped during undo unless they are incomplete, in which case they are undone.

ARIES uses three passes during recovery. The first pass is the analysis pass when ARIES collects information and determines where redo must start. This is followed by the redo pass, and then by the undo pass. Derby omits the analysis pass as this is not required due to the way checkpoints are done.

Derby recovery process

Implemented in org.apache.derby.impl.store.raw.log.LogToFile.recover()

Following is a high level description of Derby recovery process in Derby.

In this implementation, the log is a stream of log records stored in one or more flat files. Recovery is done in 2 passes: redo and undo.

Redo pass
In the redo pass, reconstruct the state of the rawstore by repeating exactly what happened before as recorded in the log.
Undo pass
In the undo pass, all incomplete transactions are rolled back in the order from the most recently started to the oldest.

Recovery Redo pass

Implemented in org.apache.derby.impl.store.raw.log.FileLogger.redo()

The log stream is scanned from the beginning (or from the undo low water mark of a checkpoint) forward until the end. The purpose of the redo pass is to repeat history, i.e, to repeat exactly the same set of changes the rawStore went thru right before it stopped. With each log record that is encountered in the redo pass:

  1. if it isFirst(), then the transaction factory is called upon to create a new transaction object.
  2. if it needsRedo(), its doMe() is called (if it is a compensation operation, then the undoable operation needs to be created first before the doMe is called).
  3. if it isComplete(), then the transaction object is closed.

Recovery Undo pass

Implemented in org.apache.derby.impl.store.raw.xact.XactFactory.rollbackAllTransactions()

Rollback all active transactions that has updated the raw store. Transactions are rolled back in the following order:

  1. Internal transactions in reversed beginXact chronological order
  2. all other transactions in reversed beginXact chronological order

Checkpoints

Implemented in org.apache.derby.impl.store.raw.log.LogToFile.checkpoint()

Only one checkpoint is to be taking place at any given time.

The steps of a checkpoint are:

  1. Switch to a new log file if possible.

    1. Freeze the log (for the transition to a new log file)
    2. Flush current log file
    3. Create and flush the new log file (with file number 1 higher than the previous log file). The new log file becomes the current log file.
    4. Unfreeze the log
  2. Start checkpoint transaction
  3. Gather interesting information about the rawStore:

    1. The current log instant (redoLWM)
    2. The earliest active transaction begin tran log record instant (undoLWM) , all the truncation LWM set by clients of raw store (replication)
  4. Clean the buffer cache
  5. Log the next checkpoint log record, which contains (repPoint, undoLWM, redoLWM) and commit checkpoint transaction.
  6. Synchronously write the control file containing the next checkpoint log record log instant
  7. The new checkpoint becomes the current checkpoint. Somewhere near the beginning of each log file should be a checkpoint log record (not guarenteed to be there)
  8. See if the log can be truncated

The earliest useful log record is determined by the repPoint and the undoLWM, whichever is earlier.

Every log file whose log file number is smaller than the earliest useful log record's log file number can be deleted.

Transactions can be at the following states w/r to a checkpoint - consider the log as a continous stream and not as series of log files for the sake of clarity:

|(BT)-------(ET)| marks the begin and end of a transaction.
.                          checkpoint started
.       |__undoLWM          |
.       V                   |___redoLWM
.                           |___TruncationLWM
.                           |
.                           V
1 |-----------------|
2       |--------------------------------|
3           |-------|
4               |--------------------------------------(end of log)
5                                       |-^-|
.                                   Checkpoint Log Record
---A--->|<-------B--------->|<-------------C-----------

There are only 3 periods of interest :
A) before undoLWM, B) between undo and redo LWM, C) after redoLWM.

Transaction 1 started in A and terminates in B.
During redo, we should only see log records and endXact from this transaction in the first phase (between undoLWM and redoLWM). No beginXact log record for this transaction will be seen.

Transaction 2 started in B (right on the undoLWM) and terminated in C.
Any transaction that terminates in C must have a beginXact at or after undoLWM. In other words, no transaction can span A, B and C. During redo, we will see beginXact, other log records and endXact for this transaction.

Transaction 3 started in B and ended in B.
During redo, we will see beginXact, other log records and endXact for this transaction.

Transaction 4 begins in B and never ends.
During redo, we will see beginXact, other log records. In undo, this loser transaction will be rolled back.

Transaction 5 is the transaction taking the checkpoint.
The checkpoint action started way back in time but the checkpoint log record is only written after the buffer cache has been flushed.

Note that if any time elapse between taking the undoLWM and the redoLWM, then it will create a 4th period of interest.

Derby Logging Overview

A loggable action in Derby is redoable. If the action implements Undoable interface, then it is also undoable. When an undoable action is rolled back, it must generate a Compensation log which represents the action necessary to repeat the undo.

Normally a logged action is rolled back on the same page that it was originally applied to. This is called physical or physiological undo. If the undo needs to be applied to a different page (such as due to a page split in a BTree), then it is called a Logical Undo. In Derby, BTree inserts and deletes require logical undo.

When performing a loggable action, Derby follows this sequence:

  1. Convert the action into a corresponding log operation. Most BTree and Heap operations are translated to Page level actions - ie - the action involves updating one or more pages. For example, a single Heap row insert may be translated to inserts on several pages. Each page insert will be a separate loggable action.
  2. Generate the log data that describes the page level action.
  3. Perform the action after it has been logged. Also, the action is performed using the logged data, in the same way as it would be performed during recovery. In other words, the logged data is used both for normal operations as well as for repeating history. This has the advantage that the recovery execution path is the same as the execution path during normal execution.
  4. If a transaction is being rolled back, first the loggable action is asked to generate the corresponding undo (Compensation) log data. This is then logged, and after that it is performed. As described before, a Compensation action is only redoable, because by definition, an undo action does not need to be undone.

Loggable Interface Hierarchy

  • interface org.apache.derby.iapi.store.raw.Loggable

Container Log Operations Hierarchy

Transaction Management Log Operations Hierarchy

  • class org.apache.derby.impl.store.raw.xact.BeginXact (implements org.apache.derby.iapi.store.raw.Loggable)
  • class org.apache.derby.impl.store.raw.xact.EndXact (implements org.apache.derby.iapi.store.raw.Loggable)
  • class org.apache.derby.impl.store.raw.log.CheckpointOperation (implements org.apache.derby.iapi.store.raw.Loggable)

Page Level Log Operations Hierarchy