Joins and Performance
Joins and Performance
Joins, SQL statements in which Derby selects data from two or more tables using one or more key columns from each table, can vary widely in performance. Factors that affect the performance of joins are join order, indexes, and join strategy.
Join Order Overview
In a join operation involving two tables, Derby scans the tables in a particular order. Derby accesses rows in one table first, and this table is now called the outer table.
Then, for each qualifying row in the outer table, Derby looks for matching rows in the second table, which is called the inner table.
Derby accesses the outer table once, and the inner table probably many times (depending on how many rows in the outer table qualify).
- If the join has no restrictions in the WHERE clause that would limit the number of rows returned from one of the tables to just a few, the following rules apply:
- If only one table has an index on the joined column or columns, it is much better for that table to be the inner table. This is because for each of the many inner table lookups, Derby can use an index instead of scanning the entire table.
- Since indexes on inner tables are accessed many times, if the index on one table is smaller than the index on another, the table with the smaller one should probably be the inner table. That is because smaller indexes (or tables) can be cached (kept in Derby's memory, allowing Derby to avoid expensive I/O for each iteration).
- On the other hand, if a query has restrictions in the WHERE clause for one table that would cause it to return only a few rows from that table (for example, WHERE flight_id = 'AA1111'), it is better for the restricted table to be the outer table. Derby will have to go to the inner table only a few times anyway.
SELECT * FROM huge_table, small_table WHERE huge_table.unique_column = 1 AND huge_table.other_column = small_table.non_unique_column
- In this case, the qualification huge_table.unique_column = 1 (assuming a unique index on the column) qualifies only one row, so it is better for huge_table to be the outer table in the join.
The most common join strategy in Derby is called a nested loop. For each qualifying row in the outer table, Derby uses the appropriate access path (index or table) to find the matching rows in the inner table.
Another type of join in Derby is called a hash join. For joins of this type, Derby constructs a hash table representing all the selected columns of the inner table. For each qualifying row in the outer table, Derby does a quick lookup on the hash table to get the inner table data. Derby has to scan the inner table or index only once, to build the hash table.
Nested loop joins are preferable in most situations.
Hash joins are preferable in situations in which the inner table values are unique and there are many qualifying rows in the outer table. Hash joins require that the statement's WHERE clause be an optimizable equijoin:
- It must use the = operator to compare column(s) in the outer table to column(s) in the inner table.
- References to columns in the inner table must be simple column references. Simple column references are described in Directly Optimizable Predicates.
The hash table for a hash join is held in memory and if it gets big enough, it can cause the JVM to run out of memory. The optimizer makes a very rough estimate of the amount of memory required to make the hash table. If it estimates that the amount of memory required would exceed the system-wide limit of memory use for a table, the optimizer chooses a nested loop join instead.
If memory use is not a problem for your environment, set this property to a high number; allowing the optimizer the maximum flexibility in considering a join strategy queries involving large queries leads to better performance. It can also be set to smaller values for more limited environments.
- Derby allows multiple columns as hash keys.
Table of Contents