apache > db
Apache DB Project
Font size:      

Avoiding table scans of large tables

Avoiding table scans of large tables

Derby is fast and efficient, but when tables are huge, scanning tables might take longer than a user would expect. It's even worse if you then ask Derby to sort this data.

Things that you can do to avoid table scans fall into two categories. These categories are, in order of importance:

  1. Index, Index, Index
  2. Prevent the user from issuing expensive queries

Index, Index, Index

Have you ever thought what it would be like to look up a phone number in the phone book of a major metropolitan city if the book were not indexed by name? For example, to look up the phone number for John Jones, you could not go straight to the J page. You would have to read the entire book. That is what a table scan is like. Derby has to read the entire table to retrieve what you are looking for unless you create useful indexes on your table.

Create useful indexes

Indexes are useful when a query contains a WHERE clause. Without a WHERE clause, Derby is supposed to return all the data in the table, and so a table scan is the correct (if not desirable) behavior. (More about that in Prevent the user from issuing expensive queries.)

Derby creates indexes on tables in the following situations:

  • When you define a primary key, unique, or foreign key constraint on a table. See "CONSTRAINT clause" in the Derby Reference Manual for more information.
  • When you explicitly create an index on a table with a CREATE INDEX statement.

For an index to be useful for a particular statement, one of the columns in the statement's WHERE clause must be the first column in the index's key.

For a complete discussion of how indexes work and when they are useful (including pictures), see What Is an Index? and Index use and access paths.

Indexes provide some other benefits as well:

  • If all the data requested are in the index, Derby does not have to go to the table at all. (See Covering Indexes.)
  • For operations that require a sort (ORDER BY), if Derby uses the index to retrieve the data, it does not have to perform a separate sorting step for some of these operations in some situations. (See About the Optimizer's Choice of Sort Avoidance.)
Derby does not support indexing long columns.

Make sure indexes are being used, and rebuild them

If an index is useful for a query, Derby is probably using it. However, you need to make sure. Analyze the way Derby is executing your application's queries. See Analyzing Statement Execution for information on how to do this.

In addition, over time, index pages fragment. Rebuilding indexes improves performance significantly in these situations. To rebuild an index, drop it and then re-create it.

Think about index order

Derby allows you to create index columns in descending order in addition to creating them in ascending order, the default. Descending indexes provide performance benefits for the following kinds of queries:

  • Queries that require sorting data in descending order.
  • Queries that return the minimum or maximum value.

Such indexes can also be used in the same way that ascending indexes are.

To ensure performance benefits, verify that the descending index is being used. Force the index if necessary. See Analyzing Statement Execution for information on how to do this.

Think About Join Order

For some queries, join order can make the difference between a table scan (expensive) and an index scan (cheap). Here's an example:

select fts.flight_id, fa.flight_date, fts.depart_time
from flights fts, flightavailability fa
where fts.flight_id = fa.flight_id and 
fts.segment_number = fa.segment_number
and fts.dest_airport = 'CDG'
and fts.orig_airport = 'DUB'
order by flight_date

If Derby chooses Hotels as the outer table, it can use the index on Hotels to retrieve qualifying rows. (Given the data in toursDB, it will return three rows; three hotels have a city_id of 10.) Then it need only look up data in HotelAvailability three times, once for each qualifying row. And to retrieve the appropriate rows from HotelAvailability, it can use an index for HotelAvailability's hotel_id column instead of scanning the entire table.

If Derby chooses the other order, with HotelAvailability as the outer table, it will have to probe the Hotels table for every row, not just three rows, because there are no qualifications on the HotelAvailability table.

For more information about join order, see Joins and Performance.

Derby usually chooses a good join order. However, as with index use, you should make sure. Analyze the way Derby is executing your application's queries. See Analyzing Statement Execution for information on how to do this.

Decide Whether a Descending Index Would Be Useful

Derby allows you to create an index that uses the descending order for a column. Such indexes improve the performance of queries that order results in descending order or that search for the minimum or maximum value of an indexed column. For example, both of the following queries could benefit from indexes that use descending ordering:

-- would benefit from an index like this:
-- CREATE INDEX c_id_desc ON Citites(city_id DESC) 
 -- would benefit from an index like this:
-- CREATE INDEX f_miles_desc on Flights(miles DESC) 
SELECT MAX(miles) FROM Flight
 -- would benefit from an index like this:
-- CREATE INDEX arrival_time_desc ON Flights(dest_airport, arrive_time DESC)
SELECT * FROM Flights WHERE dest_airport = 'LAX' 

Prevent the user from issuing expensive queries

Some applications have complete control over the queries that they issue; the queries are built into the applications. Other applications allow users to construct queries by filling in fields on a form. Any time you let users construct ad-hoc queries, you risk the possibility that the query a user constructs will be one like the following:

SELECT * FROM ExtremelyHugeTable
ORDER BY unIndexedColumn

This statement has no WHERE clause. It will require a full table scan. To make matters worse, Derby will then have to order the data. Most likely, the user does not want to browse through all 100,000 rows, and does not care whether the rows are all in order.

Do everything you can to avoid table scans and sorting of large results (such as table scans).

Some things you can do to disallow such runaway queries:

  • Use client-side checking to make sure some minimal fields are always filled in. Eliminate or disallow queries that cannot use indexes and are not optimizable. In other words, force an optimizable WHERE clause by making sure that the columns on which an index is built are included in the WHERE clause of the query. Reduce or disallow DISTINCT clauses (which often require sorting) on large tables.
  • For queries with large results, do not let the database do the ordering. Retrieve data in chunks (provide a Next button to allow the user to retrieve the next chunk, if desired), and order the data in the application.
  • Do not use SELECT DISTINCT to populate lists; instead, maintain a separate table of the unique items.

Previous Page
Next Page
Table of Contents