apache > db
Apache DB Project
Font size:      

Derby's Cost-Based Optimization

Derby's Cost-Based Optimization

The query optimizer makes cost-based decisions to determine:

About the Optimizer's Choice of Access Path

The optimizer's choice of access path can depend on the number of rows it will have to read. It tries to choose a path that requires the fewest number of rows read. For joins, the number of rows read also depends heavily on the join order (discussed in About the Optimizer's Choice of Join Order.)

How does the optimizer know how many rows a particular access path will read? The answer: sometimes it knows exactly, and sometimes it has to make an educated guess. See "Selectivity and Cardinality Statistics".

About the Optimizer's Choice of Join Order

The optimizer chooses the optimal join order as well as the optimal index for each table. The join order can affect which index is the best choice. The optimizer can choose an index as the access path for a table if it is the inner table, but not if it is the outer table (and there are no further qualifications).

The optimizer chooses the join order of tables only in simple FROM clauses. Most joins using the JOIN keyword are flattened into simple joins, so the optimizer chooses their join order.

The optimizer does not choose the join order for outer joins; it uses the order specified in the statement.

When selecting a join order, the optimizer takes into account:

  • the size of each table
  • the indexes available on each table
  • whether an index on a table is useful in a particular join order
  • the number of rows and pages to be scanned for each table in each join order
Derby does transitive closure on qualifications. For details, see Transitive Closure.

Join Order Case Study

For example, consider the following situation:

The Flights table (as you know) stores information about flight segments. It has a primary key on the flight_id and segment_number columns. This primary key constraint is backed up by a unique index on those columns.

The FlightAvailability table, which stores information about the availability of flight segments on particular days, can store several rows for a particular row in the Flights table (one for each date).

You want to see information about all the flights, and you issue the following query:

FROM FlightAvailability AS fa, Flights AS fts
WHERE fa.flight_id = fts.flight_id
AND fa.segment_number = fts.segment_number

First imagine the situation in which there are no useful indexes on the FlightAvailability table.

Using the join order with FlightAvailability as the outer table and Flights as the inner table is cheaper because it allows the flight_id/segment_number columns from FlightAvailability to be used to probe into and find matching rows in Flights, using the primary key index on Flights.flight_id and Flights.segment_number.

This is preferable to the opposite join order (with Flights as the outer table and FlightAvailability as the inner table) because in that case, for each row in Flights, the system would have to scan the entire FlightAvailability table to find the matching rows (because there is no useful index-- an index on the flight_id/segment_number columns).

Second, imagine the situation in which there is a useful index on the FlightAvailability table (this is actually the case in the sample database). FlightAvailability has a primary key index on flight_id, segment_number, and booking_date. In that index, the flight_id-segment_number combination is not unique, since there is a one-to-many correspondence between the Flights table and the FlightAvailability table. However, the index is still very useful for finding rows with particular flight_id/segment_number values.

You issue the same query:

FROM FlightAvailability AS fa, Flights AS fts
WHERE fa.flight_id = fts.flight_id
AND fa.segment_number = fts.segment_number

Although the difference in cost is smaller, it is still cheaper for the Flights table to be the inner table, because its index is unique, whereas FlightAvailability's index is not. That is because it is cheaper for Derby to step through a unique index than through a non-unique index.

About the Optimizer's Choice of Join Strategy

The optimizer compares the cost of choosing a hash join (if a hash join is possible) to the cost of choosing a nested loop join and chooses the cheaper strategy. For information about when hash joins are possible, see Join Strategies.

In some cases, the size of the hash table that Derby would have to build is prohibitive and can cause the JVM to run out of memory. For this reason, the optimizer has an upper limit on the size of a table on which it will consider a hash join. It will not consider a hash join for a statement if it estimates that the size of the hash table would exceed the system-wide limit of memory use for a table, the optimizer chooses a nested loop join instead. The optimizer's estimates of size of hash tables are approximate only.

About the Optimizer's Choice of Sort Avoidance

Some SQL statements require that data be ordered, including those with ORDER BY, GROUP BY, and DISTINCT. MIN() and MAX() aggregates also require ordering of data.

Derby can sometimes avoid sorting steps for:

Derby can also perform the following optimizations, but they are not based on cost:

Cost-Based ORDER BY Sort Avoidance

Usually, sorting requires an extra step to put the data into the right order. This extra step can be avoided for data that are already in the right order. For example, if a single-table query has an ORDER BY on a single column, and there is an index on that column, sorting can be avoided if Derby uses the index as the access path.

Where possible, Derby's query compiler transforms an SQL statement internally into one that avoids this extra step. For information about internal transformations, see Sort Avoidance. This transformation, if it occurs, happens before optimization. After any such transformations are made, the optimizer can do its part to help avoid a separate sorting step by choosing an already sorted access path. It compares the cost of using that path with the cost of sorting. Derby does this for statements that use an ORDER BY clause in the following situations:

  • The statements involve tables with indexes that are in the correct order.
  • The statements involve scans of unique indexes that are guaranteed to return only one row per scan.

ORDER BY specifies a priority of ordering of columns in a result set. For example, ORDER BY X, Y means that column X has a more significant ordering than column Y.

The situations that allow Derby to avoid a separate ordering step for statements with ORDER BY clauses are:

  • Index scans, which provide the correct order.
    -- covering index 
    SELECT flight_id FROM Flights ORDER BY flight_id
  • The rows from a table when fetched through an index scan.
    -- if Derby  uses the index on orig_airport
    -- to access the data, it can avoid the sort
    -- required by the final ORDER BY 
    SELECT orig_airport, miles
    WHERE orig_airport < 'DDD'
    ORDER BY orig_airport
  • The rows from a join when ordered by the indexed column or columns in the outer table.
    -- if Derby  chooses Cities as the outer table, it
    -- can avoid a separate sorting step 
    SELECT * FROM cities, countries
    WHERE cities.country_ISO_code = countries.country_ISO_code
    AND cities.country_ISO_code < 'DD'
    ORDER BY cities.country_ISO_code
  • Result sets that are guaranteed to return a single row. They are ordered on all of their columns (for example, if there are equality conditions on all the columns in a unique index, all the columns returned for that table can be considered ordered, with any priority of ordering of the columns).
    -- query will only return one row, so that row is
    -- "in order" for ANY column 
    SELECT miles
    FROM Flights
    WHERE flight_id = 'US1381' AND segment_number = 2
    ORDER BY miles
  • Any column in a result set that has an equality comparison with a constant. The column is considered ordered with no priority to its ordering.
    -- The comparison of segment_number
    -- to a constant means that it is always correctly
    -- ordered. Using the index on (flight_id, segment_number)
    -- as the access path means
    -- that the ordering will be correct for the ORDER BY
    -- clause in this query. The same thing would be true if
    -- flight_id were compared to a constant instead. 
    SELECT segment_number, flight_id
    FROM Flights
    WHERE segment_number=2
    ORDER BY segment_number, flight_id

    And because of transitive closure, this means that even more complex statements can avoid sorting. For example:

    -- transitive closure means that Derby  will
    -- add this clause:
    -- AND countries.country_ISO_code = 'CL', which means
    -- that the ordering column is now compared to a constant,
    -- and sorting can be avoided. 
    SELECT * FROM cities, countries
    WHERE cities.country_ISO_code = 'CL'
    AND cities.country_ISO_code = countries.country_ISO_code
    ORDER BY countries.country_ISO_code

    For more information about transitive closure and other statement transformations, see Appendix A, Internal Language Transformations.

About the System's Selection of Lock Granularity

When a system is configured for row-level locking, it decides whether to use table-level locking or row-level locking for each table in each DML statement. The system bases this decision on the number of rows read or written for each table, and on whether a full conglomerate scan is done for each table.

When you have turned off row-level locking for your system, Derby always uses table-level locking.

The first goal of the system's decision is concurrency; wherever possible, the system chooses row-level locking. However, row-level locking uses a lot of resources and might have a negative impact on performance. Sometimes row-level locking does not provide much more concurrency than table-level locking. In those situations, the system might choose to escalate the locking scheme from row-level locking to table-level locking to improve performance. For example, if a connection is configured for TRANSACTION_SERIALIZABLE isolation, the system chooses table-level locking for the following statement:

FROM FlightAvailability AS fa, Flights AS fts
WHERE fts.flight_id = fa.flight_id
AND fts.segment_number = fa.segment_number

To satisfy the isolation requirements, Derby would have to lock all the rows in both the FlightAvailability and the Flights tables. Locking both the tables would be cheaper, would provide the same isolation, and would allow the same concurrency.

You can force lock escalation for specific tables when you alter them with the LOCKSIZE clause. For these tables, Derby always chooses table-level locking. For more information, see the Derby Reference Manual.

How the System Makes Its Decision if it Has a Choice

If the lock granularity (whether to lock rows or entire tables) is not forced by the user, the system makes a decision using the following rules:

  • For SELECT statements running in READ_COMMITTED isolation, the system always chooses row-level locking.
  • If the statement scans the entire table or index and it does not meet the criteria above, the system chooses table-level locking. (A statement scans the entire table whenever it chooses a table as the access path.)
  • If a statement partially scans the index, the system uses row-level locking, until the number of rows touched on a table reaches lock escalation threshold. It is then escalated to a table lock. (You can configure this threshold number; see Lock Escalation Threshold.)
    • For SELECT, UPDATE, and DELETE statements, the number of rows touched is different from the number of rows read. If the same row is read more than once, it is considered to have been touched only once. Each row in the inner table of a join can be read many times, but can be touched at most one time.

Lock Escalation Threshold

The system property derby.locks.escalationThreshold determines the threshold for number of rows touched for a particular table above which the system will escalate to table-level locking. The default value of this property is 5000. For large systems, set this property to a higher value. For smaller systems, lower it.

This property also sets the threshold for transaction-based lock escalation (see Transaction-Based Lock Escalation).

For more information about lock escalation, see Locking and Performance.

About the Optimizer's Selection of Bulk Fetch

When Derby retrieves data from a conglomerate, it can fetch more than one row at a time. Fetching more than one row at a time is called bulk fetch. By default, Derby fetches 16 rows at a time.

Bulk fetch is faster than retrieving one row at a time when a large number of rows qualify for each scan of the table or index. Bulk fetch uses extra memory to hold the pre-fetched rows, so it should be avoided in situations in which memory is scarce.

Bulk fetch is automatically turned off for updatable cursors, for hash joins, for statements in which the scan returns a single row, and for subqueries. It is useful, however, for table scans or index range scans:

FROM Flights
WHERE miles > 4
FROM Flights

The default size for bulk fetch (16 rows) typically provides good performance.

Previous Page
Next Page
Table of Contents