pyuca 1.2 Released with Support for New Versions of Unicode

pyuca is my pure-Python implementation of the Unicode Collation Algorithm—a library I use almost every day to properly sort Greek (although the library is not Greek-specific). I was recently asked how to use pyuca with a more recent DUCET than 6.3.0. That led to me needing to make a number of changes to the core code so it now supports 8.0.0, 9.0.0 and 10.0.0 as long as you have the right Python version.

pyuca has always supported custom collation element tables, but when someone tried the DUCET from Unicode 8.0.0, the test suite failed.

At first I thought perhaps that was because the test suite is from 6.3.0 (or 5.2.0 if running Python 2.7) but when I got around to trying the 8.0.0 test suite on the 8.0.0 DUCET it too failed.

It turned out to be that a few changes were made by the Unicode Consortium to what code points are considered CJK Unified Ideographs. This is hard-coded in pyuca because it’s required for implementing the implicit weight calculations (weights for certain CJK ideographs are calculated programmatically rather than explicitly listed in the DUCET).

In 9.0.0 the collation element table format was slightly changed to add a new @implicitweights directive so for things to work with 9.0.0, I had to implement that. Then in 10.0.0, more changes were made to what code points are considered CJK Unified Ideographs.

It didn’t stop there, though. Because pyuca relies on Python’s unicodedata library for getting information on character categories, certain versions of Python won’t work with certain versions of Unicode.

So I added some logic (both to pyuca itself, and to the test suite) to use the appropriate collation code (with the right implicit weight calculations) and appropriate DUCET depending on what version of Python you are running.

Some of this dispatching-based-on-Python-version had already been written by Chris Beaven, Paul McLanahan, and Michal Čihař as part of their backporting of pyuca to 2.7 (after I’d declared I’d only support 3). So I just extended this with the following results:

  • Python 2.7: test and use 5.2.0
  • Python 3.3: test 5.2.0, 6.3.0 and use 6.3.0 by default
  • Python 3.4: test 5.2.0, 6.3.0 and use 6.3.0 by default
  • Python 3.5: test 5.2.0, 6.3.0, 8.0.0 and use 8.0.0 by default
  • Python 3.6: test 5.2.0, 6.3.0, 8.0.0, 9.0.0 and use 9.0.0 by default
  • Python 3.7-dev: test 5.2.0, 6.3.0, 8.0.0, 9.0.0, 10.0.0 (so we’re ready)

pyuca 1.2 has now been released and is available on PyPI. The repository is at https://github.com/jtauber/pyuca.


Comments on “pyuca 1.2 Released with Support for New Versions of Unicode”