apache > db
Apache DB Project
 
Font size:      

Derby On Disk Page Format

This document describes the storage format of Derby disk pages. This is a work-in-progress derived from Javadoc comments and from explanations Mike Matrigali posted to the Derby lists. Please post questions, comments, and corrections to derby-dev@db.apache.org.

Introduction

Derby stores table and index data in Containers, which currently map to files in the seg0 directory of the database. In the current Derby implementation there is a 1 to 1 mapping of containers to files. Two containers never map to a single file and 1 container never maps to multiple files.

Data is stored in pages within the container.

A page contains a set of records, which can be accessed by "slot", which defines the order of the records on the page, or by "id" which defines the identity of the records on the page. Clients access records by both slot and id, depending on their needs.

A Table or a BTree index provides a row-based access mechanism (row-based access interface is known as conglomerate). Rows are mapped to records in data pages; in case of a table, a single row can span multiple records in multiple pages.

A container can have three types of pages:

  • Header Page - which is just a specialized version of the Alloc Page.
  • Data Pages which hold data, and
  • Alloc Pages which hold page allocation information. An Alloc page is a specialized verion of the Data page.

The container can be visualised as:

Header Page is currently always page 0 of the container. It contains information that raw store needs to maintain about the container once per container, and is currently implemented as an Alloc Page which "borrows" space from the alloc page for it's information. The original decision was that the designers did not want to waste a whole page for header information, so a part of the page was used and the first allocation map was put on the second half of it. See AllocPage.java for info about layout and borrowing.

Allocation Page - After page 0, all subsequent Allocation pages only have allocation bit maps.

Data Page Format

A data page is broken into five sections.

Format Id

The formatId is a 4 bytes array, it contains the format Id of this page. The possible values are RAW_STORE_STORED_PAGE or RAW_STORE_ALLOC_PAGE.

Page Header

The page header is a fixed size, 56 bytes.

Size Type Description
1 byte boolean is page an overflow page
1 byte byte

page status is either VALID_PAGE or INVALID_PAGE(a field maintained in base page)

page goes thru the following transition:
VALID_PAGE <-> deallocated page -> free page <-> VALID_PAGE

deallocated and free page are both INVALID_PAGE as far as BasePage is concerned.
When a page is deallocated, it transitioned from VALID_PAGE to INVALID_PAGE.
When a page is allocated, it trnasitioned from INVALID_PAGE to VALID_PAGE.

8 bytes long pageVersion (a field maintained in base page)
2 bytes unsigned short number of slots in slot offset table
4 bytes integer next record identifier
4 bytes integer generation number of this page (Future Use)
4 bytes integer previous generation of this page (Future Use)
8 bytes bipLocation the location of the beforeimage page (Future Use)
2 bytes unsigned short number of deleted rows on page. (new release 2.0)
2 bytes short spare for future use
4 bytes integer spare for future use (encryption uses to write random bytes here).
8 bytes long spare for future use
8 bytes long spare for future use
Note
Spare space is guaranteed to be writen with "0", so that future use of field should not either not use "0" as a valid data item or pick 0 as a valid default value so that on the fly upgrade can assume that 0 means field was never assigned.

Records

The records section contains zero or more records. Each record starts with a Record Header

Record Header
Type Description
1 byte

Status bits for the record header

RECORD_DELETED used to indicate the record has been deleted
RECORD_OVERFLOW used to indicate the record has been overflowed, it will point to the overflow page and ID
RECORD_HAS_FIRST_FIELD used to indicate that firstField is stored will be stored. When RECORD_OVERFLOW and RECORD_HAS_FIRST_FIELD both are set, part of record is on the page, the record header also stores the overflow point to the next part of the record.
RECORD_VALID_MASK A mask of valid bits that can be set currently, such that the following assert can be made:
compressed int record identifier
compressed long overflow page only if RECORD_OVERFLOW is set
compressed int overflow id only if RECORD_OVERFLOW is set
compressed int first field only if RECORD_HAS_FIRST_FIELD is set - otherwise 0
compressed int number of fields in this portion - only if RECORD_OVERFLOW is false OR RECORD_HAS_FIRST_FIELD is true - otherwise 0
Long Rows
A row is long if all of it's columns can't fit on a single page. When storing a long row, the segment of the row which fits on the page is left there, and a pointer column is added at the end of the row. It points to another row in the same container on a different page. That row will contain the next set of columns and a continuation pointer if necessary. The overflow portion will be on an "overflow" page, and that page may have overflow portions of other rows on it (unlike overflow columns).

The Record Header is followed by one or more fields. Each field contains a Field Header and optional Field Data.

Stored Field Header Format
status

The status is 1 byte, it indicates the state of the field. A FieldHeader can be in the following states:

NULL if the field is NULL, no field data length is stored
OVERFLOW indicates the field has been overflowed to another page. overflow page and overflow ID is stored at the end of the user data. field data length must be a number greater or equal to 0, indicating the length of the field that is stored on the current page. The format looks like this: overflowPage will be written as compressed long, overflowId will be written as compressed Int
NONEXISTENT the field no longer exists, e.g. column has been dropped during an alter table
EXTENSIBLE the field is of user defined data type. The field may be tagged.
TAGGED the field is TAGGED if and only if it is EXTENSIBLE.
FIXED the field is FIXED if and only if it is used in the log records for version 1.2 and higher.
fieldDataLength The fieldDataLength is only set if the field is not NULL. It is the length of the field that is stored on the current page. The fieldDataLength is a variable length CompressedInt.
fieldData

Overflow page and overflow id are stored as field data. If the overflow bit in status is set, the field data is the overflow information. When the overflow bit is not set in status, then, fieldData is the actually user data for the field. That means, field header consists only field status, and field data length.
A non-overflow field:

An overflow field:

overflowPage and overflowID
The overflowPage is a variable length CompressedLong, overflowID is a variable Length CompressedInt. They are only stored when the field state is OVERFLOW. And they are not stored in the field header. Instead, they are stored at the end of the field data. The reason we do that is to save a copy if the field has to overflow.

Long Columns
A column is long if it can't fit on a single page. A long column is marked as long in the base row, and it's field contains a pointer to a chain of other rows in the same container with contain the data of the row. Each of the subsequent rows is on a page to itself. Each subsquent row, except for the last piece has 2 columns, the first is the next segment of the row and the second is the pointer to the the following segment. The last segment only has the data segment.

Slot Offset Table

The slot offset table is a table of 6 or 12 bytes per record, depending on the pageSize being less or greater than 64K:

Slot Table Record
Size Content
2 bytes (unsigned short) or 4 bytes (int) page offset for the record that is assigned to the slot
2 bytes (unsigned short) or 4 bytes (int) the length of the record on this page.
2 bytes (unsigned short) or 4 bytes (int) the length of the reserved number of bytes for this record on this page.

First slot is slot 0. The slot table grows backwards. Slots are never left empty.

Checksum

8 bytes of a java.util.zip.CRC32 checksum of the entire's page contents without the 8 bytes representing the checksum.

Allocation Page

An allocation page of the file container extends a normal Stored page, with the exception that a hunk of space may be 'borrowed' by the file container to store the file header.

The borrowed space is not visible to the alloc page even though it is present in the page data array. It is accessed directly by the FileContainer. Any change made to the borrowed space is not managed or seen by the allocation page.

The reason for having this borrowed space is so that the container header does not need to have a page of its own.

Page Format
An allocation page extends a stored page, the on disk format is different from a stored page in that N bytes are 'borrowed' by the container and the page header of an allocation page will be slightly bigger than a normal stored page. This N bytes are stored between the page header and the record space.

The reason why this N bytes can't simply be a row is because it needs to be statically accessible by the container object to avoid a chicken and egg problem of the container object needing to instantiate an alloc page object before it can be objectified, and an alloc page object needing to instantiate a container object before it can be objectified. So this N bytes must be stored outside of the normal record interface yet it must be settable because only the first alloc page has this borrowed space. Other (non-first) alloc page have N == 0.

N is a byte that indicates the size of the borrowed space. Once an alloc page is initialized, the value of N cannot change.

The maximum space that can be borrowed by the container is 256 bytes.

The allocation pages are of the same page size as any other pages in the container. The first allocation page of the FileContainer starts at the first physical byte of the container. Subsequent allocation pages are chained via the nextAllocPageOffset. Each allocation page is expected to manage at least 1000 user pages (for 1K page size) so this chaining may not be a severe performance hit. The logical -> physical mapping of an allocation page is stored in the previous allocation page. The container object will need to maintain this mapping.

The following fields are stored in the page header:

Format of Alloc Page
Type Description
int FormatId (Although 4 bytes are allocated, this uses only the first 2 bytes. Next 2 bytes are unused.)
StoredPageHeader see Stored Page Header
long nextAllocPageNumber - the next allocation page's number
long nextAllocPageOffset - the file offset of the next allocation page
long reserved1 - reserved for future usage
long reserved2 - reserved for future usage
long reserved3 - reserved for future usage
long reserved4 - reserved for future usage
byte N - the size of the borrowed container info
byte[N] containerInfo - the content of the borrowed container info
AllocExtent The one and only extent on this alloc page.

The allocation page contains allocation extent rows. In this first cut implementation, there is only 1 allocation extent row per allocation page.

The allocation extent row is an externalizable object and is directly written on to the page by the alloc page. In other words, it will not be converted in to a storeableRow. This is to cut down overhead, enhance performance and gives more control of the size and layout of the allocation extent row to the alloc page.

Alloc Page detailed implementation notes

Create Container - an embryonic allocation page is formatted on disk by a special static function to avoid instantiating a full AllocPage object. This embryonic allocation has enough information that it can find the file header and not much else. Then the allocation page is properly initialized by creating the first extent.

Open Container - A static AllocPage method will be used to read off the container information directly from disk. Even if the first alloc page (page 0) is already in the page cache, it will not be used because cleaning the alloc page will introduce a deadlock if the container is not in the container cache. Long term, the first alloc page should probably live in the container cache rather than in the page cache.

Get Page - The first alloc page (page 0) will be read into the page cache. Continue to follow the alloc page chain until the alloc page that manages the specified page is found. From the alloc page, the physical offset of the specified page is located.

Cleaning alloc page - the alloc page is written out the same way any page is written out. The container object will provide a call back to the alloc page to write the current version of the container object back into the borrowed space before the alloc page itself is written out.

Cleaning the container object - get the the first alloc page, dirty it and clean it (which will cause it to call the container object to write itself out into the borrowed space). The versioning of the container is independent of the versioning of the alloc page. The container version is stored inside the borrowed space and is opaque to the alloc page.

For the fields in an allocation extent row.

Allocation Extent

An allocation extent row manages the page status of page in the extent. AllocExtent is externalizable and is written to the AllocPage directly, without being converted to a row first.

Format of Allocation Extent
Type Description
long extentOffset - the begin physical byte offset of the first page of this extent
long extentStart - the first logical page mananged by this extent.
long extentEnd - the last page this extent can ever hope to manage.
int extentLength - the number of pages allocated in this extent
int

extentStatus - status bits for the whole extent.
HAS_DEALLOCATED - most likely, this extent has a deallocated page somewhere. If !HAD_DEALLOCATED, the extent has no deallocated page.
HAS_FREE - most likely, this extent has a free page somewhere. If !HAS_FREE, there is no free page in the extent.
ALL_FREE - most likely, this extent only has free pages, good candidate for shrinking the file. If !ALL_FREE, the extent is not all free.
HAS_UNFILLED_PAGES - most likely, this extent has unfilled pages. if !HAS_UNFILLED_PAGES, all pages are filled.
KEEP_UNFILLED_PAGES - this extent keeps track of unfilled pages (post v1.3). If not set, this extent has no notion of unfilled page and has no unFilledPage bitmap.
NO_DEALLOC_PAGE_MAP - this extents do not have a dealloc and a free page bit maps. Prior to 2.0, there are 2 bit maps, a deallocate page bit map and a free page bit map. Cloudscape 2.0 and later merged the dealloc page bit map into the free page bit map.
RETIRED - this extent contains only 'retired' pages, never use any page from this extent. The pages don't actually exist, i.e., it maps to nothing (physicalOffset is garbage). The purpose of this extent is to blot out a range of logical page numbers that no longer exists for this container. Use this to reuse a physical page when a logical page has exhausted all recordId or for logical pages that has been shrunk out.

int preAllocLength - the number of pages that have been preallocated
int reserved1
long reserved2 - reserved for future use
long reserved3 - reserved for future use
FreePages(bit) Bitmap of free pages. Bit[i] is ON if page is free for immediate (re)use.
unFilledPages(bit) Bitmap of pages that have free space. Bit[i] is ON if page is likely to be < 1/2 full.

org.apache.derby.iapi.services.io.FormatableBitSet is used to store the bit map. FormatableBitSet is an externalizable class.

A page can have the following logical state:
Free - a page that is free to be used
Valid - a page that is currently in use

There is another type of transitional pages which pages that have been allocated on disk but has not yet been used. These pages are Free.

Bit[K] freePages Bit[i] is ON iff page i maybe free for reuse. User must get the dealloc page lock on the free page to make sure the transaction.

K is the size of the bit array, it must be >= length.