A character set is a set of symbols and encodings. Character data types are represented as Unicode 2.0 sequences in Derby.
Derby supports a wide range of character sets and encodes all of the character sets by using the Unicode support provided by the java.lang.Character class in the Java Virtual Machine (JVM) in which the Derby database runs. See the Java API documentation for the java.lang.Character class for the exact level of Unicode Standard that is supported.
A collation is a set of rules for comparing characters in a character set. In Derby the collation rules affect comparisons of the CHAR and VARCHAR data types. Collation rules also affect how the LIKE Boolean operator processes the CHAR, VARCHAR, CLOB, and LONG VARCHAR data types.
The default Derby collation rule is based on the binary Unicode values of the characters. So a character is greater than (<), equal to (=), or less than (>) another character based on the numeric comparison of the Unicode values. This rule allows for very efficient comparisons of character strings.
Derby also supports the ability to define collation rules that are appropriate to a territory, and is referred to as territory-based collation. Derby supports the territories that Java supports.
You can specifically set the territory of a database when you create the database. If you do not specify a territory, Derby uses the default territory of the JVM in which the database is created. Each JVM can support many territories that are independent from the default territory for the JVM. Collation support for these additional territories is provided through the java.text.RuleBasedCollator class and the set of rules for these territories. Refer to the JVM specification for details of how these rules are used to provide territory specific collation. Derby currently supports only running those rules that can be loaded dynamically from the running JVM based on the territory attribute. Overrides to these rules by the user are not supported.
The territory-based collation in Derby affects how the CHAR and VARCHAR data types are compared. Specifying territory-based collation also impacts how the LIKE Boolean operator processes CHAR, VARCHAR, CLOB, and LONG VARCHAR data.
Territory-based collation does add extra processing overhead to all character-based comparison operations.
When you create a Derby database, the attributes that you set determine the collation that is used with all of character data in the database. The following table shows some examples.
Example Create URLs | Collation Is Driven By |
---|---|
jdbc:derby:abcDB;create=true | Unicode codepoint collation (UCS_BASIC), which is the default collation for Derby databases. |
jdbc:derby:abcDB;create=true;territory=es_MX | Unicode codepoint collation (UCS_BASIC). The collation attribute is not set. |
jdbc:derby:abcDB;create=true;collation=TERRITORY_BASED | The territory of the JVM, since the territory attribute
is not set. Tip: To determine the territory of the JVM, run Locale.getDefault().
|
jdbc:derby:abcDB;create=true;territory=es_MX;collation=TERRITORY_BASED | The territory attribute. |
????????????UCS_BASIC collation???????????? Territory-based collation
????????????????????????????????Grape????????????????????????????????????????????????????????????????Grape
????????????????????????????????Pineapple???????????????????????????????????????????????????? orange
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????Pineapple
For information on creating case-insensitive databases, see Creating a database with territory-based collation.
For WHERE clause 1, Derby returns TRUE because the collation elements for the entire string 'zcb' will match the collation elements of the entire string 'xycb'.
For WHERE clause 2, Derby returns FALSE because collation element for character 'z' does not match the collation element for character 'x'. In addition, when metacharacter such as an underscore is used with the LIKE operator, the metacharacter counts for one character in the string value. A clause like WHERE 'xycb' LIKE '_cb' returns FALSE because 'x' is compared to the metacharacter _ and 'y' does not match 'c'.