Overview of ICU collation settings

ICU use is becoming more prominent in PostgreSQL. One of the benefits that ICU offers is a lot of customization options for collations. Some of these are given as examples in the PostgreSQL documentation, but I have always found it hard to get complete and easily-accessible information about this.

So for this article, I dug deeper and looked up all the collation settings that there are and tried to work out examples for each one.

I recommend reading my previous article on how collation works. But even if you haven’t done that, this should be interesting as an overview of the possibilities.

That standard document to use as the starting point is Unicode Technical Standard #35: Unicode Locale Data Markup Language – Part 5: Collation. The primary purpose of that document (or document series) is to specify a way to specify locale data. As part of it, it also specifies various settings and other ways to customize collations.

The other place to look is in the source code of CLDR, which is the data that ICU actually uses: https://github.com/unicode-org/cldr/blob/main/common/bcp47/collation.xml

Note that while the Unicode Collation Algorithm (UCA) and other Unicode standards on the one side and ICU and CLDR on the other are closely aligned, there are minor documented differences, for example different defaults for some settings. In practice, you can ignore the UCA defaults and only pay attention to the ICU defaults, but it is worth being aware of this when studying various documents.

There are two groups of collation customizations, and I will split this article into two to keep it manageable:

Parametric settings: These are settings that change the collation order in some algorithmic way independent of the underlying data or language. An example of this is und-u-ka-shifted (see below).
Settings that choose a collation variant defined in some other specification. Examples of this are und-u-co-emoji and de-u-co-phonebk.

In this article, I’m going to look at the parametric settings in more detail. I’ll leave the other settings for a future article.

You can also follow the more technical specification of the parametric settings here: https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings. For each setting, I’m also going to quote the definition from the CLDR XML file, for easy reference.

A note on specifying locale names: This could be a topic on its own, but just to help understand the below: There are two ways to specify locale names in ICU: The modern BCP 47 syntax like en-u-kn-true and the old ICU-specific one like en@colNumeric=yes. In this article, I’m going to use the BCP 47 syntax throughout, but I’m using the older, more verbose keys as section headings, for easier navigation.

The specification for BCP 47 is RFC 5646. The -u- you see in the examples below is the extension identifier registered to Unicode. The reference for that extension in turn is RFC 6067, which points to https://cldr.unicode.org/index/bcp47-extension, which then points to the above XML file for collation-specific information. (There are other locale concerns beyond collation, such as calendars and number formatting, which are covered by these specifications but are not relevant to this article.)

I’m basing all the example locales on und, which stands for “undetermined” and uses the language-neutral (as good as possible) default order defined by UCA.

And a final note before we get started: All examples are with PostgreSQL 15 and ICU 72.

colAlternate

<key name="ka" description="Collation parameter key for alternate handling" alias="colAlternate">
    <type name="noignore" description="Variable collation elements are not reset to ignorable" alias="non-ignorable"/>
    <type name="shifted" description="Variable collation elements are reset to zero at levels one through three"/>
</key>

This setting determines how variable collation elements, such as whitespace and punctuation, are handled. I talked about this in detail in a previous article.

Here is a quick example:

create collation a (provider = icu, locale = 'und-u-ka-noignore');  -- default
create collation b (provider = icu, locale = 'und-u-ka-shifted');

select * from (values ('death'), ('deluge'), ('de luge')) _(x)
    order by x collate a;
    x
---------
 de luge
 death
 deluge

select * from (values ('death'), ('deluge'), ('de luge')) _(x)
    order by x collate b;
    x
---------
 death
 de luge
 deluge

Note that the ICU default is “noignore” but the UCA default is “shifted”.

See the previous article for more details.

colBackwards

<key name="kb" description="Collation parameter key for backward collation weight" alias="colBackwards">
    <type name="true" description="The second level to be backwards" alias="yes"/>
    <type name="false" description="No backwards (the second level to be forwards)" alias="no"/>
</key>

This setting causes the secondary weights (the accents) to be backwards, meaning they are compared starting from the end of the string. This used to be used in French, but it is now obsolescent.

Example:

create collation a (provider = icu, locale = 'und-u-kb-false');  -- normal
create collation b (provider = icu, locale = 'und-u-kb-true');   -- backwards

select * from (values ('cote'), ('coté'), ('côte'), ('côté')) _(x)
    order by x collate a;
  x
------
 cote
 coté
 côte
 côté

select * from (values ('cote'), ('coté'), ('côte'), ('côté')) _(x)
    order by x collate b;
  x
------
 cote
 côte
 coté
 côté

As you can see here, the francophone world is apparently divided on this:

select * from (values ('cote'), ('coté'), ('côte'), ('côté')) _(x)
    order by x collate "fr-FR-x-icu";
  x
------
 cote
 coté
 côte
 côté

select * from (values ('cote'), ('coté'), ('côte'), ('côté')) _(x)
    order by x collate "fr-CA-x-icu";
  x
------
 cote
 côte
 coté
 côté

Again, this is being phased out in practice. But it is an interesting curiosity.

colCaseLevel

<key name="kc" description="Collation parameter key for case level" alias="colCaseLevel">
    <type name="true" description="The case level is inserted in front of tertiary" alias="yes"/>
    <type name="false" description="No special case level handling" alias="no"/>
</key>

This setting allows for creating collations that ignore accents but not case. Using the colStrength (ks) setting, you can ignore weights from the end, for example tertiary and beyond. But to allow ignoring accents but not case you want to ignore the secondary weight but keep the tertiary. That’s what this setting accomplishes.

Example:

create collation a (provider = icu,
                    locale = 'und-u-ks-identic', deterministic = false);
create collation b (provider = icu,
                    locale = 'und-u-kc-true-ks-level1', deterministic = false);

select distinct x collate a
    from (values ('foo'), ('Foo'), ('bar'), ('bär')) _(x);
  x
-----
 foo
 bar
 bär
 Foo

select distinct x collate b
    from (values ('foo'), ('Foo'), ('bar'), ('bär')) _(x);
  x
-----
 Foo
 foo
 bar

The above is probably the most common and obvious use case of this.

The more complete and complicated explanation is that this setting creates an additional weight between the second and third level (accordingly called the “2.5” level) that encodes the case differences. This virtual level is kept even if weights past the primary one would otherwise be ignored, as in the above example.

I have so far always said that the third level encodes the case, but that is not all it does. It also encodes other minor variations of characters. Consider the following:

0061  ; [.2075.0020.0002] # LATIN SMALL LETTER A
24D0  ; [.2075.0020.0006] # CIRCLED LATIN SMALL LETTER A
00E4  ; [.2075.0020.0002][.0000.002B.0002] # LATIN SMALL LETTER A WITH DIAERESIS
0041  ; [.2075.0020.0008] # LATIN CAPITAL LETTER A
24B6  ; [.2075.0020.000C] # CIRCLED LATIN CAPITAL LETTER A
2090  ; [.2075.0020.0015] # LATIN SUBSCRIPT SMALL LETTER A

If we turn the colCaseLevel setting on, a notional “2.5” weight is inserted that would look like this:

0061  ; [.2075.0020.0001.0002] # LATIN SMALL LETTER A
24D0  ; [.2075.0020.0001.0006] # CIRCLED LATIN SMALL LETTER A
00E4  ; [.2075.0020.0001.0002][.0000.002B.0001.0002] # LATIN SMALL LETTER A WITH DIAERESIS
0041  ; [.2075.0020.0003.0008] # LATIN CAPITAL LETTER A
24B6  ; [.2075.0020.0003.000C] # CIRCLED LATIN CAPITAL LETTER A
2090  ; [.2075.0020.0001.0015] # LATIN SUBSCRIPT SMALL LETTER A

Now we can create a collation that keeps the “2.5” weights but ignores the original tertiary weights and beyond. That collation maintains the case distinction but ignores other third-level variations:

create collation c (provider = icu,
                    locale = 'und-u-kc-true-ks-level2', deterministic = false);

select distinct x collate a
    from (values (u&'\0061'),
                 (u&'\24D0'),
                 (u&'\00E4'),
                 (u&'\0041'),
                 (u&'\24B6'),
                 (u&'\2090')) _(x);
 x
---
 ₐ
 a
 Ⓐ
 ⓐ
 A
 ä

select distinct x collate b
    from (values (u&'\0061'),
                 (u&'\24D0'),
                 (u&'\00E4'),
                 (u&'\0041'),
                 (u&'\24B6'),
                 (u&'\2090')) _(x);
 x
---
 a
 A

select distinct x collate c
    from (values (u&'\0061'),
                 (u&'\24D0'),
                 (u&'\00E4'),
                 (u&'\0041'),
                 (u&'\24B6'),
                 (u&'\2090')) _(x);
 x
---
 ä
 a
 A

See https://www.unicode.org/reports/tr35/tr35-collation.html#Case_Parameters for more details.

colCaseFirst

<key name="kf" description="Collation parameter key for ordering by case" alias="colCaseFirst">
    <type name="upper" description="Upper case to be sorted before lower case"/>
    <type name="lower" description="Lower case to be sorted before upper case"/>
    <type name="false" description="No special case ordering" alias="no"/>
</key>

This setting allows you to select whether lower or upper-case letters should sort first. By default, lower case letters sort first.

Example:

create collation a (provider = icu, locale = 'und-u-kf-lower');
create collation b (provider = icu, locale = 'und-u-kf-upper');

select * from (values ('foo'), ('Foo'), ('bar'), ('Bar')) _(x)
    order by x collate "und-x-icu";
  x
-----
 bar
 Bar
 foo
 Foo

select * from (values ('foo'), ('Foo'), ('bar'), ('Bar')) _(x)
    order by x collate a;
  x
-----
 bar
 Bar
 foo
 Foo

select * from (values ('foo'), ('Foo'), ('bar'), ('Bar')) _(x)
    order by x collate b;
  x
-----
 Bar
 bar
 Foo
 foo

This setting works with and without the colCaseLevel (kc) setting. If colCaseLevel is true, then the notional “2.5” weight is modified to modify the case order. For example, continuing the earlier example, for a locale und-u-kc-true-kf-upper, the notional adjusted weights would be:

0061  ; [.2075.0020.0003.0002] # LATIN SMALL LETTER A
24D0  ; [.2075.0020.0003.0006] # CIRCLED LATIN SMALL LETTER A
00E4  ; [.2075.0020.0003.0002][.0000.002B.0003.0002] # LATIN SMALL LETTER A WITH DIAERESIS
0041  ; [.2075.0020.0001.0008] # LATIN CAPITAL LETTER A
24B6  ; [.2075.0020.0001.000C] # CIRCLED LATIN CAPITAL LETTER A
2090  ; [.2075.0020.0003.0015] # LATIN SUBSCRIPT SMALL LETTER A

If colCaseLevel is false, the tertiary weight is modified. For example, for a locale und-u-kf-upper (-kc-false), the notional weights would be:

0061  ; [.2075.0020.30002] # LATIN SMALL LETTER A
24D0  ; [.2075.0020.30006] # CIRCLED LATIN SMALL LETTER A
00E4  ; [.2075.0020.30002][.0000.002B.30002] # LATIN SMALL LETTER A WITH DIAERESIS
0041  ; [.2075.0020.10008] # LATIN CAPITAL LETTER A
24B6  ; [.2075.0020.1000C] # CIRCLED LATIN CAPITAL LETTER A
2090  ; [.2075.0020.30015] # LATIN SUBSCRIPT SMALL LETTER A

But this ends up providing the same order, unless additional options further modify or ignore certain weight levels.

colHiraganaQuaternary

<key name="kh" deprecated="true" description="Collation parameter key for special Hiragana handling" alias="colHiraganaQuaternary">
    <type name="true" description="Hiragana to be sorted before all non-variable on quaternary level" alias="yes"/>
    <type name="false" description="No special handling for Hiragana" alias="no"/>
</key>

Hiragana is part of the Japanese writing system. I am not familiar with that, so I won’t try to conjure up some example here. UTS #35 also says: “Deprecated: Use rules with quarternary relations instead.”

colNormalization

<key name="kk" description="Collation parameter key for normalization" alias="colNormalization">
    <type name="true" description="Convert text into Normalization Form D before calculating collation weights" alias="yes"/>
    <type name="false" description="Skip normalization" alias="no"/>
</key>

Ordinarily, strings are normalized to NFD form before collation. With this setting, this can be disabled. This would give better performance, but might behave incorrectly (or at least differently) in some cases.

The default collation weights are set up so that normalization should not matter in most cases. For example, consider the composed and decomposed forms of “LATIN SMALL LETTER A WITH DIAERESIS”:

00E4  ; [.2075.0020.0002][.0000.002B.0002] # LATIN SMALL LETTER A WITH DIAERESIS

0061  ; [.2075.0020.0002] # LATIN SMALL LETTER A
0308  ; [.0000.002B.0002] # COMBINING DIAERESIS

These would result in exactly the same collation weights for a string whether it is normalized or not.

There are a some cases where normalization does make a difference to the collation. These are described here: https://www.unicode.org/reports/tr35/tr35-collation.html#Normalization_Setting

colNumeric

<key name="kn" description="Collation parameter key for numeric handling" alias="colNumeric">
    <type name="true" description="A sequence of decimal digits is sorted at primary level with its numeric value" alias="yes"/>
    <type name="false" description="No special handling for numeric ordering" alias="no"/>
</key>

This setting causes sequences of digits to sort by their numeric value (numeric ordering, also known as natural sort).

Example:

create collation a (provider = icu, locale = 'und-u-kn-false');  -- standard
create collation b (provider = icu, locale = 'und-u-kn-true');   -- numeric

select * from (values ('A-123'), ('A-12n'), ('A-21'), ('B-100')) _(x)
    order by x collate a;
   x
-------
 A-123
 A-12n
 A-21
 B-100

select * from (values ('A-123'), ('A-12n'), ('A-21'), ('B-100')) _(x)
    order by x collate b;
   x
-------
 A-12n
 A-21
 A-123
 B-100

This can clearly be useful in practice, for sorting almost-numeric things like some product or invoice numbers.

It also works for non-ASCII digits:

select * from (values ('१२३'), ('२१')) _(x) order by x collate a;
  x
-----
 १२३
 २१

select * from (values ('१२३'), ('२१')) _(x) order by x collate b;
  x
-----
 २१
 १२३

colReorder

<key name="kr" description="Collation reorder codes" valueType="multiple" alias="colReorder" since="21">
    <type name="space" description="Whitespace reordering code, see LDML Part 5: Collation" since="21"/>
    <type name="punct" description="Punctuation reordering code, see LDML Part 5: Collation" since="21"/>
    <type name="symbol" description="Symbol reordering code (other than currency), see LDML Part 5: Collation" since="21"/>
    <type name="currency" description="Currency reordering code, see LDML Part 5: Collation" since="21"/>
    <type name="digit" description="Digit (number) reordering code, see LDML Part 5: Collation" since="21"/>
    <type name="REORDER_CODE" description="Other collation reorder code — for script, see LDML Part 5: Collation" since="21"/>
</key>

This setting allows reordering whole blocks of characters relative to each other. For example, by default, digits sort before Latin letters, and Latin letters sort before Cyrillic letters. Someone had to pick some default order, and that’s what it ended up with. With this setting, you can easily specify, sort Cyrillic before Latin letters, and digits after, without having to reorder each character separately. There are many possible values here, including more alphabets and other groups of characters, such as the ones shown in the XML source.

Example:

create collation a (provider = icu, locale = 'und');
create collation b (provider = icu, locale = 'und-u-kr-cyrl-latn-digit');

select * from (values ('123'), ('România'), ('България'), ('Србија')) _(x)
    order by x collate a;
    x
----------
 123
 România
 България
 Србија

select * from (values ('123'), ('România'), ('България'), ('Србија')) _(x)
    order by x collate b;
    x
----------
 България
 Србија
 România
 123

Note that language-specific locales may override the default ordering. For example, the Bulgarian locale puts the Cyrillic script first:

select * from (values ('123'), ('România'), ('България'), ('Србија')) _(x)
    order by x collate "bg-x-icu";
    x
----------
 123
 България
 Србија
 România

See https://www.unicode.org/reports/tr35/tr35-collation.html#Script_Reordering for more details.

colStrength

<key name="ks" description="Collation parameter key for collation strength" alias="colStrength">
    <type name="level1" description="The primary level" alias="primary"/>
    <type name="level2" description="The secondary level" alias="secondary"/>
    <type name="level3" description="The tertiary level" alias="tertiary"/>
    <type name="level4" description="The quaternary level" alias="quaternary quarternary"/>
    <type name="identic" description="The identical level" alias="identical"/>
</key>

This specifies how many levels of collation weights are taken into account. The other levels are simply ignored. A typical use of this is case-insensitive collation, by ignoring the third and following levels. (Recall that the second level is accents, the third is case.)

Example:

create collation a (provider = icu,
                    locale = 'und-u-ks-identic', deterministic = false);
create collation b (provider = icu,
                    locale = 'und-u-ks-level2', deterministic = false);

select distinct x collate a
    from (values ('foo'), ('Foo'), ('bar'), ('Bar')) _(x);
  x
-----
 foo
 bar
 Bar
 Foo

select distinct x collate b
    from (values ('foo'), ('Foo'), ('bar'), ('Bar')) _(x);
  x
-----
 foo
 bar

You can also ignore both case and accents:

create collation a (provider = icu,
                    locale = 'und-u-ks-identic', deterministic = false);
create collation b (provider = icu,
                    locale = 'und-u-ks-level1', deterministic = false);

select distinct x collate a
    from (values ('foo'), ('Foo'), ('bar'), ('bär')) _(x);
  x
-----
 foo
 bar
 bär
 Foo

select distinct x collate b
    from (values ('foo'), ('Foo'), ('bar'), ('bär')) _(x);
  x
-----
 bar
 foo

Note that you cannot use this by itself to ignore accents but not case. For that, you need to use the colCaseLevel (kc) setting.

maxVariable

<key name="kv" description="Collation parameter key for maxVariable, the last reordering group to be affected by ka-shifted" since="25">
    <type name="space" description="Only spaces are affected by ka-shifted" since="25"/>
    <type name="punct" description="Spaces and punctuation are affected by ka-shifted (CLDR default)" since="25"/>
    <type name="symbol" description="Spaces, punctuation and symbols except for currency symbols are affected by ka-shifted (UCA default)" since="25"/>
    <type name="currency" description="Spaces, punctuation and all symbols are affected by ka-shifted" since="25"/>
</key>

This specifies which characters are to be variable collation elements, which are affected by the colAlternate (ka) setting. In my previous article on this, I always talked about “whitespace and punctuation”, but as you can see here, other groups of characters can also be considered.

Let’s look at an example:

create collation a (provider = icu, locale = 'und-u-ka-shifted');
create collation b (provider = icu, locale = 'und-u-ka-shifted-kv-currency');

select * from (values ('death'), ('deluge'), ('de luge'), ('de$luge')) _(x)
    order by x collate a;
    x
---------
 de$luge
 death
 de luge
 deluge

select * from (values ('death'), ('deluge'), ('de luge'), ('de$luge')) _(x)
    order by x collate b;
    x
---------
 death
 de luge
 de$luge
 deluge

Note how in the first example, the currency character $ sorts like a non-variable character, before the character a, while in the second example, it is affected by the same variable treatment as the space character.

By default, spaces and punctuation are treated as variable in CLDR, whereas UCA also treats symbols as variable by default.

variableTop

<key name="vt" deprecated="true" description="Collation parameter key for variable top" valueType="multiple" alias="variableTop">
    <type name="CODEPOINTS" description="The variable top (one or more Unicode code points: LDML Appendix Q)"/>
</key>

This is a deprecated alternative to maxVariable where instead of specifying variable characters by group names, you select the variable characters by specifying that all characters with a primary weight below a certain value (the “top”) should be variable. This is clearly less convenient than the alternative, so it is deprecated.

Summary

Here is a summary of the discussed collation settings, ignoring the officially deprecated ones:

Alias	Key	Description
colAlternate	ka	alternate handling
colBackwards	kb	backward collation weight
colCaseLevel	kc	case level
colCaseFirst	kf	ordering by case
colNormalization	kk	normalization
colNumeric	kn	numeric handling
colReorder	kr	reorder codes
colStrength	ks	collation strength
maxVariable	kv	last reordering group to be affected by ka-shifted

As mentioned at the beginning, I will look at the (non-parametric) collation/co settings in another article in the future.