Unicode Concerns: Difference between revisions

From Official Kodi Wiki
Jump to navigation Jump to search
No edit summary
Line 54: Line 54:


There are several approaches to solving the second problem. One is to do a caseless, accentless comparison. It looks like SQLLite and MySQL both support these. Extra columns or tables (for caseless, accentless copies of the search data) may be needed to improve performance. For some discussion on the topic see: https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf
There are several approaches to solving the second problem. One is to do a caseless, accentless comparison. It looks like SQLLite and MySQL both support these. Extra columns or tables (for caseless, accentless copies of the search data) may be needed to improve performance. For some discussion on the topic see: https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf
[[Category:Development]]

Revision as of 21:06, 19 March 2022

This page highlights some of the Unicode Issues beyond simply enabling your addon or Kodi runtime for basic Unicode support. Here are a few:

  • The good news is that for the bulk of your work, it is fairly simple to read and display translated messages. The basic plumbing is in place.
  • Side effects of Kodi issue 19883: Turkish locale set via LANG / LC_CTYPE / LC_ALL or by SetGlobalLocale breaks skin loading on any Linux. (github.com/xbmc/xbmc/issues/19883)
  • Handling text from multiple locales (i.e complications with foreign movies or actors)
  • How to handle internal programming keywords along with other text
  • Quirks of some languages
  • Searching multilingual text
  • Unicode file names
  • Unicode in XML
  • Unicode file names in a zip file
  • Json or XML processing of Unicode with keywords
  • Lower/Upper case does not take you very far
  • String length != number of Unicode 'letters'
  • The same character (both visually and logically) can sometimes be represented by more than one sequence of Unicode 'letters'
  • Minimal support in Python and C++
  • Other Unicode libs exist, but non-trivial to use

A good source of information can be found at: Unicode.org. This same organization provides a good (but not the easiest to learn) and free library that performs advanced Unicode processing.

Details:

Side effects of Kodi issue 19883

Kodi was unusable with Locale = tr_TR (Turkish). A quick fix was made, but it has some undesirable side-effects on the Python addons:

  • The default locale cannot be read, making it impossible to determine the country code ('US')
  • The encoding for filenames is ASCII instead of UTF-8. Trying to open a file with a non-ASCII name throws exceptions. You have to explicitly encode the filename (ex: io.open(self.path.encode('utf-8'))).

On closer inspection, the Kodi C++ application has multiple problems: As is common with many programs, keywords are caseless and the code normalized all keywords to lower case. However, there are multiple problems with this:

  • ToLower works with most, but not all locales/character sets since in almost all languages ToUpper(ToLower(char)) == char. However Turkish (and others) have several characters that do not obey this rule. In addition, several of these characters are in common with English. The letter 'i' is one of the characters.
  • Kodi uses the same locale for processing external text as well as internal keywords. This caused any keyword containing an 'I' to be unrecognized because ToLower would change the 'I' to a Turkish, 'dotless lower case i'

In addition, ToLower and ToUpper modifies the passed string in-place. However, there is no guarantee in Unicode that the number of bytes, or Unicode characters will be the same. This means that junk can be left at the end of the string array, or memory could be clobbered at the end of the array.

The proper solution is to create methods which handle the different ways that characters are used. Instead of using in-line calls to toLower, etc. you need methods which:

  • Creates a setting keyword from a string
    • Frequently keywords are looked up in a 'caseless' manner. But with the possibility of text being in any language things get more complicated.
    • One approach is to restrict the keyword characters in some way and then to use a specific locale just when handling the keywords. You run into trouble if you process some English characters using Turkish locale (the infamous 'i' problem).
    • Another approach is to "case fold" the characters. In simplistic terms the string is stripped of all accents, etc. and made lower-case. It is not meant for human consumption but generally works well for matching keywords. It has the advantage of being locale agnostic.
  • Creates a filename from a string
    • Some Operating Systems (Windows) internally identify the files with a caseless filename. Generally you want to compare filenames in a caseless manner (not sure if it should be done in a folded manner or not).
  • Compares strings to see if string matches one of a list of keywords. You may need different methods for different keyword types
  • Search and sort methods customized for the purpose and volume of searching

In C++ you have to worry about allocating and freeing memory for toLower, normalized, caseless or other forms of the original strings. More correctly, they are transformations on the string which can change the number of bytes that the strings take up, requiring a copy. Your method signatures frequently need to pass a pointer to this temporary string around until it is finally freed.

Setting names in Settings.xml

Although not widely documented, setting names are considered caseless, and are all mapped to lower case, as keywords are, above. The same problems exist. Addon programmers may be unaware of this. This means that in addition to the Kodi back-end treating setting names as keywords, Python addonss must do the same thing when comparing or manipulating these setting names.

Searching Text

Kodi provides the ability to search for movie title within the current playlist by simply (quickly) typing the first few characters [in upper case] of the title to search for. This won't work properly for several reasons:

  1. The above ToLower/ToUpper problem
  2. Character collation rules vary for different locales. If the user is searching for a foreign language movie title, it may not be processed/sorted as expected. Some common way of searching/sorting needs to converged on.

There are several approaches to solving the second problem. One is to do a caseless, accentless comparison. It looks like SQLLite and MySQL both support these. Extra columns or tables (for caseless, accentless copies of the search data) may be needed to improve performance. For some discussion on the topic see: https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf