Python Unicode Collation Algorithm

27th January 2006 / by James Tauber

My preliminary attempt at a Python implementation of the Unicode Collation Algorithm (UCA) is done and available at:

http://jtauber.com/2006/01/27/pyuca.py (old version—see UPDATE below)

This only implements the simple parts of the algorithm but I have successfully tested it using the Default Unicode Collation Element Table (DUCET) to collate Ancient Greek correctly.

The core of the algorithm, which is what I have implemented, basically just involves multi-level comparison. For example, café comes before caff because at the primary level, the accent is ignored and the first word is treated as if it were cafe. The secondary level (which considers accents) only applies then to words that are equivalent at the primary level.

The UCA (and my code) also support contraction and expansion. Contraction is where multiple letters are treated as a single unit—in Spanish, ch is treated as a letter coming between c and d so that, for example, words beginning ch should sort after all other words beginnings with c. Expansion is where a single letter is treated as though it were multiple letters—in German, ä is sorted as if it were ae, i.e. after ad but before af.

Here is how to use the pyuca module.

Usage example:

from pyuca import Collator
c = Collator("allkeys.txt")

sorted_words = sorted(words, key=c.sort_key)

allkeys.txt (1 MB) is available at

http://www.unicode.org/Public/UCA/latest/allkeys.txt

but you can always subset this for just the characters you are dealing with (and you will need to do this if any language-specific tailoring is needed)

UPDATE (2006-02-13): Now see bug fix

UPDATE (2012-06-21): Now see https://github.com/jtauber/pyuca

originally published on jtauber.com

← Dynamic Interlinears with Javascript and CSS File System Archaeology for MorphGNT →

Comments on “Python Unicode Collation Algorithm”

J. K. Tauber

at the intersection of computing, linguistics, philology, and learning science

Now • Projects • Articles • Labs • Blog

Atom Feed

By day I’m an entrepreneur, web technologist and open-source developer but my academic background is in linguistics (along with some classics, comparative philology, and educational statistics) and my main avocation is working on text, annotations, analysis and software relating to historical languages with a particular interest in facilitating better learning.

While my focus has mostly been on Biblical Greek, much of the work is highly relevant to other Hellenistic Greek texts, other dialects of Ancient Greek and, indeed, texts in completely different languages as well.

All code written for this endeavour is open source and text and data is made available under a Creative Commons license to the extent allowed by the sources used.

I can be contacted at jtauber@jtauber.com.

Python Unicode Collation Algorithm

Comments on “Python Unicode Collation Algorithm”

J. K. Tauber

at the intersection of computing, linguistics, philology, and learning science

Get Posts by Email