apache > db
Apache DB Project
 
Font size:      

Derby Write Ahead Log Format

This document describes the storage format of Derby Write Ahead Log. This is a work-in-progress derived from Javadoc comments and from explanations Mike Matrigali and others posted to the Derby lists. Please post questions, comments, and corrections to derby-dev@db.apache.org.

Introduction

Derby uses a Write Ahead Log to record all changes to the database. The Write Ahead Log (WAL) protocol requires the following rules to be followed:

  1. A page must be latched exclusively before it can be updated.
  2. While the latch is held, the update must be logged, and page must be tagged with the identity of the log record (often known as Log Sequence Number or LSN)
  3. When the page is about to be written to persistent storage, all logs records up to and including the page's LSN, must be forced to disk.
  4. Once the log records have been forced to disk, the cached page may be written to persistent storage, overwriting the previous version of the page.

The WAL protocol ensures that in the event of a system crash, databases pages can be restored to a consistent state using the information contained in the log records. How this is done will be the subject of another paper.

References

A good description of Write Ahead Logging, and how a log is typically implemented, can be found in Transaction Processing: Concepts and Techniques , by Jim Gray and Andreas Reuter, 1993, Morgan Kaufmann Publishers .

Derby implementation of the Write Ahead Log

Derby implements the Write Ahead Log using a non-circular file system file. Here are some comments about current implementation of recovery:

Suresh Thalamati
Derby supports simple media recovery. It has support for full backup/restore and very basic form of rollforward recovery (replay of logs using backup and archived log files).

Mike Matrigali
1. Derby fully supports crash recovery, it uses java to correctly sync the log file to support this.
2. I would say derby supports media recovery. One can make a backup of the data and store it off line. Logs can be stored on a separate disk from the data, and if you lose your data disk then you can use rollforward recovery on the existing logs and the copy of the backup to bring your database up to the current point in time.
3. Derby does not support "point in time recovery". Someone may want to look at this in the future. Technically I don't think it would be very hard as the logging system has the stuff to solve the hard problems. It does not have an idea about "time" - it just knows log sequence numbers, so need to figure out what kind of interface a user really wants. A very user unfriendly interface would not be very hard to implement which would be recover to a specific log sequence number. Anyone interested in this feature should add it to jira - I'll be happy to add technical comments on what needs to be done.
4. A reasonable next step in derby recovery progress would be to add a way to automatically move/copy log files offline as they are not needed by crash recovery and only needed for media recovery. Some sort of java stored procedure callout would seem most appropriate.

The 'log' is a stream of log records. The 'log' is implemented as a series of numbered log files. These numbered log files are logically continuous so a transaction can have log records that span multiple log files. A single log record cannot span more than one log file. The log file number is monotonically increasing.

The log belongs to a log factory of a RawStore. In the current implementation, each RawStore only has one log factory, so each RawStore only has one log (which is composed of multiple log files). At any given time, a log factory only writes new log records to one log file, this log file is called the 'current log file'.

A log file is named log logNumber .dat

With the default values, a new log file is created (this is known as log switch) when a log file grows beyond 1MB and a checkpoint happens when the amount of log written is 10MB or more from the last checkpoint.

RawStore exposes a checkpoint method which clients can call, or a checkpoint is taken automatically by the RawStore when:

  1. The log file grows beyond a certain size (configurable, default 1MB)
  2. RawStore is shutdown and a checkpoint hasn't been done "for a while"
  3. RawStore is recovered and a checkpoint hasn't been done "for a while"

LogCounter

Log records are identified using LogCounter, which is an implementation of LogInstant, a Derby term for LSN. The LogCounter is made up of the log file number, and the byte offset of the log record within the log file. Within the stored log record a log counter is represented as a long. Outside the LogFactory the instant is passed around as a LogCounter (through its LogInstant interface).

The way the long is encoded is such that < == > correctly tells if one log instant is lessThan, equals or greater than another.

Format of Write Ahead Log

An implementation of file based log is in org.apache.derby.impl.store.raw.log.LogToFile. This LogFactory is responsible for the formats of 2 kinds of file: the log file and the log control file. And it is responsible for the format of the log record wrapper.

Format of Log Control File

The log control file contains information about which log files are present and where the last checkpoint log record is located.

Type Description
int format id set to FILE_STREAM_LOG_FILE
int obsolete log file version
long the log instant (LogCounter) of the last completed checkpoint
int Derby major version
int Derby minor version
int subversion revision/build number
byte Flags (beta flag (0 or 1), test durability flag (0 or 1))
byte spare (value set to 0)
byte spare (value set to 0)
byte spare (value set to 0)
long spare (value set to 0)
long checksum for control data written

Format of the log file

The log file contains log records which record all the changes to the database. The complete transaction log is composed of a series of log files.

Type Description
int Format id of this log file, set to FILE_STREAM_LOG_FILE.
int Obsolete log file version - not used
long Log file number - this number orders the log files in a series to form the complete transaction log
long PrevLogRecord - log instant of the previous log record, in the previous log file.
[log record wrapper]* one or more log records with wrapper
int EndMarker - value of zero. The beginning of a log record wrapper is the length of the log record, therefore it is never zero
[int fuzzy end]* zero or more int's of value 0, in case this log file has been recovered and any incomplete log record set to zero.

Format of the log record wrapper

The log record wrapper provides information for the log scan.

Type Description
int length - length of the log record (for forward scan)
long instant - LogInstant of the log record
byte[length] logRecord - byte array that is written by the FileLogger
int length - length of the log record (for backward scan)

The format of a log record

The log record described every change to the persistent store

Type Description
int format_id, set to LOG_RECORD. The formatId is written by FormatIdOutputStream when this object is written out by writeObject
CompressedInt

loggable group - the loggable's group value.

Each loggable belongs to one or more groups of similar functionality.

Grouping is a way to quickly sort out log records that are interesting to different modules or different implementations.

When a module makes loggable and sent it to the log file, it must mark this loggable with one or more of the following group. If none fit, or if the loggable encompasses functionality that is not described in existing groups, then a new group should be introduced.

Grouping has no effect on how the record is logged or how it is treated in rollback or recovery.

The following groups are defined. This list serves as the registry of all loggable groups.

Loggable Groups
Name Value Description
FIRST 0x1 The first operation of a transaction.
LAST 0x2 The last operation of a transaction.
COMPENSATION 0x4 A compensation log record.
BI_LOG 0x8 A BeforeImage log record.
COMMIT 0x10 The transaction committed.
ABORT 0x20 The transaction aborted.
PREPARE 0x40 The transaction prepared.
XA_NEEDLOCK 0x80 Need to reclaim locks associated with theis log record during XA prepared xact recovery.
RAWSTORE 0x100 A log record generated by the raw store.
FILE_RESOURCE 0x400 related to "non-transactional" files.
TransactionId xactId - The Transaction this log belongs to.
Loggable op - the log operation

Pointers to relevant classes

Fixme (DM)
This section should link to appropriate Javadoc documentation
Package Class Description
org.apache.derby.iapi.store.raw.log LogFactory.java The java interface for logging system module.
org.apache.derby.impl.store.raw.log LogToFile.java The implmentation of the LogFactory.java, also implementing Module, this is the one with recovery code.
CheckpointOperation.java A Log Operation that represents a checkpoint.
FileLogger.java Deals with putting log records to disk. Writes log records to a log file as a stream (ie. log records added to the end of the file, no concept of pages).
FlushedScan.java Deals with scanning the log file. Scan the the log which is implemented by a series of log files. This log scan knows how to move across log file if it is positioned at the boundary of a log file and needs to getNextRecord.
FlushedScanHandle.java More stuff dealing with scanning the log file.
Scan.java More scan log file stuff. Scan the the log which is implemented by a series of log files. This log scan knows how to move across log file if it is positioned at the boundary of a log file and needs to getNextRecord.
StreamLogScan.java More scan log file stuff. LogScan provides methods to read a log record and get its LogInstant in an already defined scan.
LogAccessFile.java Lowest level putting log records to disk. Wraps a RandomAccessFile file to provide buffering on log writes.
LogAccessFileBuffer.java Utility for LogAccessFile. A single buffer of data.
LogCounter.java Log sequence number (LSN) implementation
LogRecord.java The log record written out to disk.
ReadOnly.java an alternate read only implementation of LogFactory