apache > db
Apache DB Project
Font size:      


Implements BTree access method, which is the basis for SQL indexes.


Derby implements secondary SQL indexes as BTrees. The high level features of the BTree implementation are:

  1. Derby implements the standard B+Tree algorithm, in which all keys are stored in leaf pages, and the upper levels are organized as a redundant index for enabling rapid location of a particular key. See The Ubiquitous B-Tree, Douglas Comer for a definition of the B+Tree.
  2. The BTree implementation supports concurrency using page level latches, and recovery using a Write Ahead Log.
  3. Derby uses data only locking for its logical row level locking.
  4. Derby supports all four SQL isolation levels, serializable, repeatable read, read committed and read uncommitted.
  5. Derby uses logical key deletes. This enables it to perform undos during rollbacks and restart recovery as single page operations.

High level structure of the B+Tree

  • The first row in all pages, branch or leaf, is a control row. In branch pages, this is a BranchControlRow and in leaf pages, it is a LeafControlRow.
  • In addition to the BranchControlRow, branch level pages contain BranchRows.
  • At branch level, a page is linked to is left and right siblings, parent, as well as children. The number of child pages is 1 greater than the number of BranchRows in the page. The leftmost child pointer is stored in the BranchControlRow itself. Each BranchRow contains a child pointer as its last column.
  • At leaf level also, a page is linked to its left and right siblings, and the parent. In addition to the LeafControlRow, leaf level pages contain IndexRows. The last column in an IndexRow is a RowLocation.
  • IndexRows are generated by the IndexRowGenerator.

Latching implementation

Derby uses latches on pages to get exclusive access to the page while reading or writing the page (Derby only uses exclusive latches, no shared latches are used). In order to prevent deadlocks latches requested while holding other latches are always requested top/down and left to right. Btree splits are always left to right. If for any reason the code needs to break this protocol then it will first request the latch NOWAIT and if it can't get the latch it will release all current latches and wait for the latch it is trying to get, and then after obtaining it go about getting the other latches it needs for the particular operation. While traversing down the tree Derby may hold 2 latches: one on parent and one on child. It then continues doing "ladder" latching down the tree releasing the highest node when it has successfully got a new lower node latch. Latches are short term, only held while reading/modifying the page, never held while an I/O is happening. Structure modifications are all isolated from other operations through the use of latches.

Locking and Isolation Levels

Derby uses data only locking for its logical row level locking. All isolation level implementation is done using logical locks (Derby does not support non-locking isolation such as multi-versioning).

In data only locking the key for all locks whether obtained in the BTree or in the base table is the address of the base table row. In the BTree case the address of the base table row (RowLocation) is always the last column of the leaf index entry (IndexRows).

Derby uses "previous key" locking to prevent phantoms from being inserted into a range being accessed by a scan. A scan always locks the index row just "previous" to the first entry that satisfies the scan condition and then all index entries going forward until the end of the scan. All row locks are held until end of transaction. Inserts also get previous key locks, which will block if the existing row is locked.
Repeatable Read
Same as serializable except that no previous key locking is done.
Read Committed
Write locks are held until end of transaction. Read locks are released once the query in question no longer needs them.
Read Uncommitted
No row locks are acquired. The code still gets table level intent locks to prevent concurrent DDL during the query.

BTree Structure Modifications

In Derby, SMOs (structure modification operations - ie. page splits), only happen top down. This is not as concurrent as bottom up in ARIES/IM, but is simpler. As in ARIES/IM Not more than 2 index pages are held latched simultaneously at anytime. In order to improve concurrency and to avoid deadlocks involving latches, even those latches are not held while waiting for a lock wich is not immediately grantable. No data page latch is held or acquired during an index access. Latch coupling is used while traversing the tree - ie. the latch on a parent page is held while requesting a latch on a child page.

In the case of a simple split, exclusive latch is held on P (parent), and C (child). If there is room in P, then new page C1 (child right) is created, new descriminator is inserted into P, and rows moved to C1. Latches are held on P and C for duration of split, and then released.

The hard case is when P does not have room for descriminator key. In this case all latches are released, and Derby does a split pass from top to bottom, and will split the internal nodes that do not have room for the decrimator key. Note this may result in more splits than necessary for this particular insert, but the assumption is that the splits will have to happen eventually anyway. After this split pass is done, the search for the insert starts again from top down, but it must once again check for space because it has given up all its latches and some other transaction may have acquired the space in the meantime.

Optimization is possible to remember C and see if it is right location, and/or use sideway pointers to search right rather than do research of tree.

Logical Key Deletes

In both the BTree and the Heap, deletes are first executed by marking a "deleted" bit. This is to insure space on the page for abort, since row level locking will allow other rows on the page to be modified conncurrently with the transaction executing the delete. The system uses a background daemon to schedule work after commit to reclaim the space of the deleted rows. A row marked deleted can be "purged" if one can obtain a lock on it (if it was an uncommitted delete then the transaction doing the commit would still have an exclusive lock on the row).

Garbage Collection of deleted keys

Since rows are only marked as "deleted", and not physically removed, it is necessary to perform space reclamation on deleted rows.

Online during BTREE split

Whenever there is not enough room on a leaf to do an insert the code attempts to find space on the leaf, by checking if it can reclaim any committed deletes on that leaf. That work only requires the latch on the leaf and NOWAIT row write locks. It is expected that most of the space reclaim done in the BTree goes through this path. Most of this work is done in {@link org.apache.derby.impl.store.access.btree.BTreeController.reclaim_deleted_rows}.

BTREE post commit work

Whenever a delete operation deletes the last row from a leaf page then a BtreePostCommit job is queued to be executed after the transaction which did the delete commits. This work currently requires a table level lock as page merges have not been implemented to be allowed concurrent with other operations. Many DBMSes don't even try to do page merges except when called from some sort of reorg utility. If all rows on page are purged, then the page will move to the free list and perform a merge to the tree.

It is expected that the normal space reclamation happens with row locks during btree split, which is why not much work has been done to optimize btree post commit path

Logging and Recovery

Derby uses physical redo and logical undo for BTree inserts and deletes. Logical undo is simplified as a result of using logical key deletes. If keys were physically removed during deletes, then the undo of a key delete would have required an insert operation which can potentially lead to page splits at various levels within the tree. Since the key is not physically removed, but only marked as "deleted", undoing a key delete is accomplished easily. However, since the page where the insert or delete should take place may have moved, it may be necessary to search for the page.

Structure modification operations (SMOs) are handled as independent internal transactions and commit separately from the transaction that initiated the SMO. Once an SMO has been completed successfully, it is not undone, even if the transaction that caused it decides to abort. During restart recovery undo phase, incomplete internal transactions are undone BEFORE any regular transactions. This ensures that the BTrees are structurally consistent before normal undo begins.