Overview of ICU collation settings, part 2

In a recent article, I covered the parametric ICU collation customization settings. In today’s article, I will describe the other major collation customization option: the alternative collation types selected with the co key in the locale identifier.

The best way to find out which types are available and what they do is to look into the source code, which for ICU is the CLDR project. Let’s start with the definition of the keys in CLDR. (In this article, all the URLs are for a specific CLDR release, which is the latest one as I write this, because, as we will see below, these things change over time.)

<key name="co" description="Collation type key" alias="collation">
    <type name="big5han" description="Pinyin ordering for Latin, big5 charset ordering for CJK characters (used in Chinese)"/>
    <type name="compat" description="A previous version of the ordering, for compatibility" since="26"/>
    <type name="dict" description="Dictionary style ordering (such as in Sinhala)" alias="dictionary"/>
    <type name="direct" description="Binary code point order (used in Hindi)" deprecated="true"/>
    <type name="ducet" description="The default Unicode collation element table order" since="2.0.1"/>
    <type name="emoji" description="Recommended ordering for emoji characters" since="27"/>
    <type name="eor" description="European ordering rules" since="24"/>
    <type name="gb2312" description="Pinyin ordering for Latin, gb2312han charset ordering for CJK characters (used in Chinese)" alias="gb2312han"/>
    <type name="phonebk" description="Phonebook style ordering (such as in German)" alias="phonebook"/>
    <type name="phonetic" description="Phonetic ordering (sorting based on pronunciation)"/>
    <type name="pinyin" description="Pinyin ordering for Latin and for CJK characters (used in Chinese)"/>
    <type name="reformed" description="Reformed ordering (such as in Swedish)"/>
    <type name="search" description="Special collation type for string search" since="1.9"/>
    <type name="searchjl" description="Special collation type for Korean initial consonant search" since="2.0.1"/>
    <type name="standard" description="Default ordering for each language"/>
    <type name="stroke" description="Pinyin ordering for Latin, stroke order for CJK characters (used in Chinese)"/>
    <type name="trad" description="Traditional style ordering (such as in Spanish)" alias="traditional"/>
    <type name="unihan" description="Pinyin ordering for Latin, Unihan radical-stroke ordering for CJK characters (used in Chinese)"/>
    <type name="zhuyin" description="Pinyin ordering for Latin, zhuyin order for Bopomofo and CJK characters (used in Chinese)" since="22"/>
</key>

Also of interest is the section that defines this in the relevant standard UTS #35: http://unicode.org/reports/tr35/#UnicodeCollationIdentifier

Just as a reminder, to use these options in a locale string that you could pass to PostgreSQL (or another ICU application), the syntax would be something like de-u-co-phonebk (new) or de@collation=phonebook (old).

Now let’s work through what these options do. I give some examples below, but many of these apply to characters that I am not really familiar with, so I will refrain from giving examples there.

big5han

Description: “Pinyin ordering for Latin, big5 charset ordering for CJK characters (used in Chinese)”

This collation type appears only in the zh (Chinese) collation. Pinyin is a romanization system for Chinese characters. Big5 is a pre-Unicode character encoding used for Chinese. So this is a collation provided for compatibility with a particular pre-Unicode convention.

compat

Description: “A previous version of the ordering, for compatibility”

This collation type appears only in the ar (Arabic) collation.

This is self-explanatory: The standard collation was tweaked and the previous version is available as the “compat” type.

dict

Description: “Dictionary style ordering (such as in Sinhala)”

This collation type appears only in the si (Sinhala) collation. It makes some minor modifications compared to the standard collation of Sinhala.

direct

Description: “Binary code point order (used in Hindi)” (deprecated)

This collation type previously appeared in the hi (Hindi) collation, but it was removed in 2012.

(PostgreSQL note: To get binary code point order, you can use the C or ucs_basic collations.)

ducet

Description: “The default Unicode collation element table order”

This collation type does not currently exist in any collation in CLDR. DUCET is the actual Unicode standard while CLDR is the set of data files that implementatons such as ICU actually use. CLDR is slightly modified (“tailored”) from DUCET. The intention of the ducet collation type was to allow users to select an untailored pure standard collation. This existed briefly in CLDR in 2012 but was quickly found to not work fully and be hard to maintain.

emoji

Description: “Recommended ordering for emoji characters”

This collation type appears in the root collation.

There is a separate standard UTS #51 for sorting emoji characters, which this collation type implements. This would be suitable, for example, for presenting emoji characters in an input selector.

Example:

select chr(x) from generate_series(x'1F634'::int, x'1F644'::int) as _(x)
    order by chr(x);
 chr
-----
 😴
 😵
 😶
 😷
 😸
 😹
 😺
 😻
 😼
 😽
 😾
 😿
 🙀
 🙁
 🙂
 🙃
 🙄

create collation c1 (provider = icu, locale = 'und-u-co-emoji');
select chr(x) from generate_series(x'1F634'::int, x'1F644'::int) as _(x)
    order by chr(x) collate c1;
 chr
-----
 🙂
 🙃
 😶
 🙄
 😴
 😷
 😵
 🙁
 😺
 😸
 😹
 😻
 😼
 😽
 🙀
 😿
 😾

You can see here that the default order puts the emoji characters in a pretty arbitrary order. The emoji sort puts them in groups (smiley faces, cats) and also has a consistent order within the groups (happy, neutral, sad). See UTS #51 for details.

eor

Description: “European ordering rules”

This collation type appears in the root collation, and there is a “national delta” in the de (German) collation.

European ordering rules is a standard (EN 13710) for sorting Latin, Greek, Cyrillic, Georgian, and Armenian characters (that is, characters used by languages used in Europe). Which, you know, UCA does as well? Looking at the history of this, it appears this was worked on before the equivalent Unicode standards were fully accepted, and I suspect this is mostly obsolete now.

gb2312

Description: “Pinyin ordering for Latin, gb2312han charset ordering for CJK characters (used in Chinese)”

This collation type appears only in the zh (Chinese) collation. Pinyin is a romanization system for Chinese characters. GB 2312 is a pre-Unicode character encoding used for Chinese (also known as EUC-CN). So this is a collation provided for compatibility with a particular pre-Unicode convention.

phonebk

Description: “Phonebook style ordering (such as in German)”

This collation type appears only in the de (German) and de_AT (German in Austria) collations.

The German standard DIN 5007 specifies the sorting of character strings. It contains two parts: One general part, and one for the sorting of lists of names. The first part is actually the same as the Unicode root collation, the second part is what is often called the phonebook sorting, but it really applies to any list of person names.

In CLDR, for the de collation, this collation type adds the following rules:

&AE<<ä<<<Ä
&OE<<ö<<<Ö
&UE<<ü<<<Ü

This says that the letter ä sorts after AE with a secondary difference, and Ä follows with a tertiary difference, and so on for the other vowels.

For de_AT, this collation type adds the following rules instead:

&a<ä<<<Ä
&o<ö<<<Ö
&u<ü<<<Ü
&ss<ß<<<ẞ

This says that the letter ä sorts as a separate letter (primary difference) after a, and so on.

Examples:

create collation german (provider = icu, locale = 'de');
create collation german_phonebook (provider = icu,
                                   locale = 'de-u-co-phonebk');
create collation austrian_phonebook (provider = icu,
                                     locale = 'de-AT-u-co-phonebk');

select *
    from (values ('Göbel'), ('Goethe'), ('Goldmann'), ('Göthe'), ('Götz')) _(x)
    order by x collate german;
     x
-----------
 Göbel
 Goethe
 Goldmann
 Göthe
 Götz

(This is the same as the root collation.)

select *
    from (values ('Göbel'), ('Goethe'), ('Goldmann'), ('Göthe'), ('Götz')) _(x)
    order by x collate german_phonebook;
     x
-----------
 Göbel
 Goethe
 Göthe
 Götz
 Goldmann

select *
    from (values ('Göbel'), ('Goethe'), ('Goldmann'), ('Göthe'), ('Götz')) _(x)
    order by x collate austrian_phonebook;
     x
-----------
 Goethe
 Goldmann
 Göbel
 Göthe
 Götz

See also https://de.wikipedia.org/wiki/Alphabetische_Sortierung#Deutsche_Sprache for more details on various variants of sorting in German.

phonetic

Description: “Phonetic ordering (sorting based on pronunciation)”

This collation type appears only in the ln (Lingala) collation. I don’t have any actual knowledge about this, but it seems pretty clear that this collation type modifies the order of some letters and letter groups to be more phonetic.

pinyin

Description: “Pinyin ordering for Latin and for CJK characters (used in Chinese)”

This collation type appears only in the zh (Chinese) collation. Pinyin is a romanization system for Chinese characters. So this collation basically sorts Chinese characters in accordance with their Latin-alphabet analogs.

reformed

Description: “Reformed ordering (such as in Swedish)”

This collation type currently does not appear in any collation. It previously appeared in the sv (Swedish) collation. In July 2022, the “reformed” collation for Swedish was renamed to “standard” and the previous standard one to “traditional”.

The only difference between the reformed (now standard) type and the traditional (previous standard) type is that the reformed type drops the rule

&v<<<V<<w<<<W

which means to sort w/W as second-level variants of v/V.

search

Description: “Special collation type for string search”

This collation type appears in the root collation and a number of language-specific collations. (But as far as I can tell, most if not all of those appearances in language-specific files just import the search rules from the root and then apply the language-specific standard rules. There don’t appear to be many language-specific search rules, but I haven’t checked this exhaustively.)

The description from UTS #35 is:

A special collation type dedicated for string search—it is not used to determine the relative order of two strings, but only to determine whether they should be considered equivalent for the specified strength, using the string search matching rules appropriate for the language. Compared to the normal collator for the language, this may add or remove primary equivalences, may make additional characters ignorable or change secondary equivalences, and may modify contractions to allow matching within them, depending on the desired behavior. For example, in Czech, the distinction between ‘a’ and ‘á’ is secondary for normal collation, but primary for search; a search for ‘a’ should never match ‘á’ and vice versa. A search collator is normally used with strength set to PRIMARY or SECONDARY (should be SECONDARY if using “asymmetric” search as described in the [UCA] section Asymmetric Search). The search collator in root supplies matching rules that are appropriate for most languages (and which are different than the root collation behavior); language-specific search collators may be provided to override the matching rules for a given language as necessary.

Interestingly, the cited Czech example no longer appears in CLDR. It was removed in 2016. The rationale given was:

The change is partly because Czech users have become accustomed to the way Google search works in Czech; and since web search is the most common search behavior, it creates expectations for other search behavior.

So there you go.

Most of the search rules in the root collation concern non-Latin alphabets such as Arabic and Korean, about which I know very little, so I won’t try to dig into that.

One rule that I can decipher is

# root search rules for Symbols
&'='<'≠'

This means to make the not-equals sign sort after the equals sign with a primary difference.

The default collation weights for these characters are:

003D  ; [*06A2.0020.0002] # EQUALS SIGN
2260  ; [*06A2.0020.0002][.0000.002F.0002] # NOT EQUAL TO

This treats the not-equals sign sort of like an accent (secondary difference) on the equals sign. This probably makes sense for sorting, but it could behave strangely when searching.

If you make a collation with primary strength, your search would do this:

create collation c1 (provider = icu,
                     locale = 'und-u-co-standard-ks-level1',
                     deterministic = false);

select * from (values (u&'x\003Dy'), (u&'x\2260y')) _(x)
    where x collate c1 = 'x=y';
  x
-----
 x=y
 x≠y

which might be undesirable. If you use the search collation type, you will get a better behavior:

create collation c2 (provider = icu,
                     locale = 'und-u-co-search-ks-level1',
                     deterministic = false);

select * from (values (u&'x\003Dy'), (u&'x\2260y')) _(x)
    where x collate c2 = 'x=y';
  x
-----
 x=y

(PostgreSQL note: It is necessary to use the deterministic = false property for collations here.)

searchjl

Description: “Special collation type for Korean initial consonant search”

This collation type appears only in the ko (Korean) collation.

standard

Description: “Default ordering for each language”

This is the default. So specifying a collation as, for example, en-u-co-standard is equivalent to just writing en.

stroke

Description: “Pinyin ordering for Latin, stroke order for CJK characters (used in Chinese)”

This collation type appears only in the zh (Chinese) collation.

trad

Description: “Traditional style ordering (such as in Spanish)”

This collation type appears in the collations bn (Bengali), es (Spanish), fi (Finnish), kn (Kannada), sa (Sanskrit), and vi (Vietnamese).

For example, the “traditional” Spanish collation has the additional rules

&C<ch<<<Ch<<<CH
&l<ll<<<Ll<<<LL

which sorts ch and ll as separate letters (primary difference) in the alphabet. This rule has been removed by the Spanish language academy. But it remains available in CLDR/ICU for those who need it.

Similarly, in other languages, some traditional rules are dropped over time to make the language rules more suitable to the computer age (I suppose). See also under reformed for when the new rules are not yet ready to become the standard ones, and also compat for a related idea.

unihan

Description: “Pinyin ordering for Latin, Unihan radical-stroke ordering for CJK characters (used in Chinese)”

This collation type appears in the ja (Japanese), ko (Korean), and zh (Chinese) collations.

Unihan refers to the Han unification of Japanese, Korean, and Chinese characters done in Unicode.

zhuyin

Description: “Pinyin ordering for Latin, zhuyin order for Bopomofo and CJK characters (used in Chinese)”

This collation type appears only in the zh (Chinese) collation. Zhuyin is a transliteration and writing system for Chinese.

Summary

As you can see from this discussion, there are a variety of reasons for having these alternative collation types. Some of them are for compatibility with previous practices or other standards, some of them were short-lived experiments, some of them are for the needs of a single language, a number of them are for the Chinese language in particular. (I found https://en.wikipedia.org/wiki/Written_Chinese#Dictionaries interesting on the latter.) Actually, all of these collation types are just a set of collation customization rules packaged up under a name. If your needs go beyond that, you can also write out the rules yourself, which PostgreSQL will support starting in version 16.