Unicode Concerns

Unicode Concerns

This page highlights some of the issues surrounding Unicode use in Kodi, beyond simply enabling basic Unicode support your add-on or Kodi runtime. The good news is, for the bulk of your work, it is fairly simple to read and display translated messages; all the basic plumbing is in place.

List of concerns

Here are a few specific concerns:

The side effects of Issue #19883^[1]
Handling text from multiple locales (i.e complications with foreign movies or actors)
How to handle internal programming keywords along with other text
Quirks of some languages
Searching multilingual text
Unicode file names
Unicode in XML
Unicode file names in a zip file
JSON or XML processing of Unicode with keywords
Lower/Upper case does not take you very far
String length ≠ number of Unicode characters
The same "glyph" (either lexically speaking or visually, and oftentimes both) can, in practice, be represented by more than one unique sequence of Unicode characters
Minimal support in Python and C++
Other Unicode libraries exist, but they are non-trivial to adapt our codebase to

A good source of background information on the matter is the implementation guidelines as described by the Unicode Consortium in the latest Unicode Core Specification (v14.0 as of this writing).^[2] Its authors provide a good (but not at all trivial to learn) and free reference implementation library called ICU (International Components for Unicode)^[3] that performs advanced Unicode processing.

Details

Side effects of Kodi issue 19883

Following the API version bumps that accompanied the release of Kodi v19 "Matrix" the default interface became unusable when the tr_TR (Turkish) locale was activated. A quick fix was made, but it has no fewer than two undesirable side effects on the Python add-ons:

the default locale cannot be read, making it impossible to determine the country code ('US'), and
the encoding for filenames is ASCII instead of UTF-8.

Trying to open a file with non-ASCII characters in its name or path throws exceptions; you have to explicitly encode the filename (e.g. io.open(self.path.encode('utf-8'))). Upon closer inspection, it was apparent that the Kodi C++ codebase has multiple problems. As is common with many programs, keywords are caseless and the code normalizes all keywords to lower case. There are multiple problems with this:

ToLower works with most, but not all, locales and/or character sets, since in almost all languages ToUpper(ToLower(char)) == char. However, Turkish (and some others) have several characters that do not obey this rule. In addition, several of these characters are in common with English, in fact the lowercase Latin script letter 'i' is one of them!
Kodi uses the same locale for processing external text and internal keywords alike. This causes any keyword containing a capital Latin script letter 'I' to be unrecognized, because ToLower would change the 'I' to a Turkish Dotless i (itself a homophone for the initial syllable in ¡Ay caramba!, but naturally not a cognate).

In addition, ToLower and ToUpper modifies the passed string in-place, however, there is no guarantee in Unicode that the number of bytes or Unicode characters coming out will be the same number that went in. This means that junk can be left, or memory could be clobbered, at the end of the string array (depending on whether the number of bytes in the output is more or less than was anticipated). The proper solution is to create methods which handle the different ways that characters are used. Instead of using in-line calls to toLower, et cetera, you need methods which:

Create a setting keyword from a string
- Frequently, keywords are looked up in a case-insensitive manner, but with the possibility of text being in any language, things quickly get more complicated
- One approach is to restrict the keyword characters in some way and then to use a specific locale just when handling said keywords, since you run into trouble if you process some English characters using the Turkish locale (the infamous 'i' problem)
- Another approach is to "case fold" the characters. Put simply, the string is stripped of all diacritics (accents, etc.) and then made lowercase. It is not meant for human consumption, but generally works well for matching keywords and it has the advantage of being locale-agnostic.
Create a filename from a string
- Some operating systems (looking at you, Windows…) internally identify the files with a case-insensitive filename. Generally, you want to compare filenames in a case-insensitive manner (not sure if it should be done in a folded manner or not).
Compare strings to see if one matches any entry on a list of keywords, though different methods may be needed for different keyword types.
Search and sort methods purposely customized to handle the volume of large searches

In C++, you have to worry about allocating and freeing memory for toLower, normalized, caseless or other forms of the original strings. More precisely, they are transformations on the strings which can change the number of bytes required to store them, requiring a copy. Your method signatures frequently need to pass a pointer to this temporary string around until it is finally freed.

Name declaration in Settings.xml

Although not widely-documented, declared names in Settings.xml are considered caseless and are all mapped to lowercase, as keywords are shown to do above, yet the same problems exist here too. Certainly, add-on developers may be unaware of this. This means that in addition to the Kodi back-end treating declared names as keywords, Python add-ons must do the same thing when comparing or manipulating these names.

Searching Text

Kodi provides the ability to search for a movie title within the current playlist by simply typing the first few characters (in uppercase) of the title to search for. This won't work properly for several reasons:

the above ToLower/ToUpper problem, and
character collation rules vary for different locales. If the user is searching for a foreign language movie title, it may not be processed/sorted as expected, requiring that we converge on some common, deterministic methods of searching and sorting.

There are several approaches to solving the second problem. One is to do a case-less, accent-less comparison; it looks like SQLite and MySQL both support these. Extra columns or tables (for case-less, accent-less copies of the search data) may be needed to improve performance. Some discussion on this very topic is in fact contained with the same Chapter 5 of the Unicode Specification referenced above.^[2]

References

↑ Gello, Vasyl (June 14, 2021). Kodi Issue #19883: Turkish locale set via LANG / LC_CTYPE / LC_ALL or by SetGlobalLocale breaks skin loading on any Linux at the Kodi Issue Tracker on GitHub.
↑ ^2.0 ^2.1 Chapter 5: Implementation Guidelines in The Unicode® Standard, Version 14.0 – Core Specification (September 14, 2021).
↑ ICU ― International Components for Unicode, produced by the ICU Technical Committee (ICU-TC) of the Unicode Consortium (April 7, 2022).

[issue_19883-1] Gello, Vasyl (June 14, 2021). Kodi Issue #19883: Turkish locale set via LANG / LC_CTYPE / LC_ALL or by SetGlobalLocale breaks skin loading on any Linux at the Kodi Issue Tracker on GitHub.

[ucstd_ch5-2] 2.0 ^2.1 Chapter 5: Implementation Guidelines in The Unicode® Standard, Version 14.0 – Core Specification (September 14, 2021).

[icu_home-3] ICU ― International Components for Unicode, produced by the ICU Technical Committee (ICU-TC) of the Unicode Consortium (April 7, 2022).

[1]

[2]

[3]