Add-on unicode paths: Difference between revisions

From Official Kodi Wiki
Jump to navigation Jump to search
(link to paragraph about general unicode stuff)
(Page outdated)
 
(10 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{mininav| [[Add-on development]] }}
{{mininav| [[Add-on development]] }}
This page describes how to prevent common problems with non latin characters in XBMC or Add-on paths.
 
{{outdated|This page or section has not been updated in a long time but many parts still apply up to v18. '''v19 introduces Python 3''' and this page no longer applies to v19 and later. This page may be archived at a later date.}}
 
This page describes how to prevent common problems with non latin characters in Kodi or Add-on paths.


== Unicode paths ==
== Unicode paths ==
If you want to write an add-on which is able to work with paths like 'd:\apps\éîäß\' or 'opt/xbmc/àí' at first you should read http://docs.python.org/2/howto/unicode.html#python-2-x-s-unicode-support
If you want to write an add-on which is able to work with paths like 'd:\apps\éîäß\' or 'opt/Kodi/àí' at first you should read http://docs.python.org/2/howto/unicode.html#python-2-x-s-unicode-support
 
After reading, you should know: "Software (Python) should only work with unicode strings internally, converting to a particular encoding on output. (or input)". To make string literals unicode by default, add
 
<syntaxhighlight enclose="div" lang="python">
from __future__ import unicode_literals
</syntaxhighlight>
 
at the top of the module. See https://docs.python.org/2/reference/simple_stmts.html#future for details.
 
Kodi outputs UTF-8 encoded strings. Input can be unicode or UTF-8 encoded, but there are rumors that some functions don't work with unicode input parameters.
 
Therefore the simplest way to deal with non-ASCII characters is to pass every parameter as UTF-8 encoded string to Kodi and to convert Kodi's UTF-8 output back to unicode.


After reading you know: "Software (Python) should only work with unicode strings internally, converting to a particular encoding on output. (or input)".
=== File functions in Python ===
XBMC outputs UTF-8 encoded strings. Input can be unicode or UTF-8 encoded, but there are rumors that some functions don't work with unicode input parameters.
==== Windows ====
Windows' NTFS is unicode aware but Windows still uses codepages like cp-850 for Western Europe.


Therefore the simplest way to deal with non latin characters is to pass every parameter as UTF-8 encoded string to XBMC and to convert XBMC's UTF-8 output back to unicode.
If you use Python file functions with string parameters then internally the strings will be converted to the Windows codepage which means that you cannot access a file with greek characters from an english Windows. But if you pass unicodes to the file functions then everything will work as expected!
 
==== Linux ====
When locale is set to C or POSIX Python will assume file system is ascii only and try to encode all unicode inputs to ascii. In reality file system does not have a specific encoding and utf-8 is a much better guess. Because of this you must not pass unicode to Python file functions!
 
Instead always use UTF-8 encoded strings.
 
==== Conclusion ====
Since your add-on should work with all supported OS, use the following approach:
<syntaxhighlight enclose="div" lang="python">
if sys.platform.startswith('win'):
    file function with unicodes
else:
    file function with utf-8 encoded strings
</syntaxhighlight>


== Examples ==
== Examples ==
=== Addon path ===
=== Addon path ===
The first path an add-on has to deal with is it's own add-on path:
The first path an add-on has to deal with is it's own add-on path:
<syntaxhighlight enclose="div" lang="python">
  path = addon.getAddonInfo('path').decode('utf-8')
  path = addon.getAddonInfo('path').decode('utf-8')
XBMC's getAddonInfo returns an UTF-8 encoded string and we decode it an unicode.
</syntaxhighlight>
Kodi's getAddonInfo returns an UTF-8 encoded string and we decode it an unicode.
=== Browse dialog ===
=== Browse dialog ===
<syntaxhighlight enclose="div" lang="python">
  dialog = xbmcgui.Dialog()
  dialog = xbmcgui.Dialog()
  directory = dialog.browse(0, 'Title' , 'pictures').decode('utf-8')
  directory = dialog.browse(0, 'Title' , 'pictures').decode('utf-8')
</syntaxhighlight>
dialog.browse() returns an UTF-8 encoded string which perhaps contains some non latin characters. Therefore decode it to unicode!
dialog.browse() returns an UTF-8 encoded string which perhaps contains some non latin characters. Therefore decode it to unicode!
=== Path joins ===
=== Path joins ===
<syntaxhighlight enclose="div" lang="python">
  os.path.join(path, filename)
  os.path.join(path, filename)
</syntaxhighlight>
If path and filename are unicodes then everthing will work as expected.  
If path and filename are unicodes then everthing will work as expected.  
But what will happen if filename is an UTF-8 encoded string which contains "öäü.jpg"?
But what will happen if filename is an UTF-8 encoded string which contains "öäü.jpg"?


Python always uses unicodes to join a string with an unicode. Therefore Python will decode the string with it's default encoding (ascii).
Python always uses unicodes to join a string with an unicode. Therefore Python will decode the string with it's default encoding (ascii).
<syntaxhighlight enclose="div" lang="python">
  os.path.join(path, filename.decode('ascii'))
  os.path.join(path, filename.decode('ascii'))
</syntaxhighlight>
Due to the missing öäü within the ASCII codepage you'll get an unicode exception! That's the reason why you must explicitly convert the string to unicode!
Due to the missing öäü within the ASCII codepage you'll get an unicode exception! That's the reason why you must explicitly convert the string to unicode!
<syntaxhighlight enclose="div" lang="python">
  os.path.join(path, filename.decode('utf-8'))
  os.path.join(path, filename.decode('utf-8'))
</syntaxhighlight>
=== Logging ===
=== Logging ===
Don't use "print message" because if message contains non latin character you'll get an unicode exception.
"print" and xbmc.log does not support unicode. Always encode unicode strings to utf-8.
Instead:
 
print message.encode('utf-8')
<syntaxhighlight enclose="div" lang="python">
which requires that message must be a unicode!
     print message.encode('utf-8')
== Useful functions ==
</syntaxhighlight>
=== smart_unicode and smart_utf8 ===
 
Because you cannot decode an unicode or encode a string it makes sense to have a function which works with unicodes and strings:
Alternatively, the following function can be used, where msg can be everything from string to unicode to class:
<pre>
 
def smart_unicode(s):
<syntaxhighlight enclose="div" lang="python">
    """credit : sfaxman"""
     if not s:
        return ''
    try:
        if not isinstance(s, basestring):
            if hasattr(s, '__unicode__'):
                s = unicode(s)
            else:
                s = unicode(str(s), 'UTF-8')
        elif not isinstance(s, unicode):
            s = unicode(s, 'UTF-8')
    except:
        if not isinstance(s, basestring):
            if hasattr(s, '__unicode__'):
                s = unicode(s)
            else:
                s = unicode(str(s), 'ISO-8859-1')
        elif not isinstance(s, unicode):
            s = unicode(s, 'ISO-8859-1')
    return s
</pre>
You can use smart_unicode to ensure that the return type is an unicode!
<pre>
def smart_utf8(s):
    return smart_unicode(s).encode('utf-8')
</pre>
And smart_utf8 to pass parameters to XBMC.
=== Logging ===
Instead of above mentioned "print message.encode('utf-8') use:
<pre>
def log(msg, level=xbmc.LOGDEBUG):
def log(msg, level=xbmc.LOGDEBUG):
     plugin = "My nice plugin"
     plugin = "My nice plugin"


     if type(msg).__name__=='unicode':
     if isinstanceof(msg, unicode):
         msg = msg.encode('utf-8')
         msg = msg.encode('utf-8')


     xbmc.log("[%s] %s"%(plugin,msg.__str__()), level)
     xbmc.log("[%s] %s" % (plugin, msg.__str__()), level)
</pre>
</syntaxhighlight>
Benefit is that msg can be everything from string to unicode to class. And you can use XBMC's debug levels to prevent spam in the XBMC.log.
=== Notification ===
<pre>
def show_notification(title, message, timeout=2000, image=""):
    if image == "":
        command = 'Notification(%s,%s,%s)' % (smart_utf8(title), smart_utf8(message), timeout)
    else:
        command = 'Notification(%s,%s,%s,%s)' % (smart_utf8(title), smart_utf8(message), timeout, smart_utf8(image))
    xbmc.executebuiltin(command)
</pre>
The show_notification function uses the smart_utf8 function to ensure that every string parameter passed to XBMC is an UTF-8 encoded string. Therefore you can call show_notification with strings or unicodes!
== OpenElec vs. Windows path and file functions ==
=== Windows ===
Windows' NTFS is unicode aware but Windows still uses codepages like cp-850 for Western Europe.
 
If you use Python file functions with string parameters then internally the strings will be converted to the Windows codepage which means that you cannot access a file with greek characters from an english Windows. But if you pass unicodes to the file functions then everything will work as expected!


=== OpenElec ===
However, it's highly recommender to never mix byte strings and unicode strings in your program, in which case the 'if isinstanceof' is unnecessary.
Due to missing codepage support in OpenElec you '''must not''' pass unicodes to Python file functions!
 
Instead always use UTF-8 encoded strings.
=== Conclusion ===
Since your add-on should work with all supported OS, use the following approach:
<pre>
try:
    file function with unicodes
except:
    try:
        file function with utf-8 encoded strings
    except:
        fatal error
</pre>


== See also ==
== See also ==
Line 113: Line 94:
'''Development:'''
'''Development:'''
* [[Add-on development]]
* [[Add-on development]]
{{frodo updated}}


[[Category:Add-on development]]
[[Category:Add-on development]]
[[Category:Development]]
[[Category:Development]]
[[Category:Python]]
[[Category:Python]]
{{frodo updated}}

Latest revision as of 22:04, 3 December 2019

Home icon grey.png   ▶ Add-on development ▶ Add-on unicode paths
Time.png THIS PAGE IS OUTDATED:

This page or section has not been updated in a long time but many parts still apply up to v18. v19 introduces Python 3 and this page no longer applies to v19 and later. This page may be archived at a later date.

This page describes how to prevent common problems with non latin characters in Kodi or Add-on paths.

Unicode paths

If you want to write an add-on which is able to work with paths like 'd:\apps\éîäß\' or 'opt/Kodi/àí' at first you should read http://docs.python.org/2/howto/unicode.html#python-2-x-s-unicode-support

After reading, you should know: "Software (Python) should only work with unicode strings internally, converting to a particular encoding on output. (or input)". To make string literals unicode by default, add

 from __future__ import unicode_literals

at the top of the module. See https://docs.python.org/2/reference/simple_stmts.html#future for details.

Kodi outputs UTF-8 encoded strings. Input can be unicode or UTF-8 encoded, but there are rumors that some functions don't work with unicode input parameters.

Therefore the simplest way to deal with non-ASCII characters is to pass every parameter as UTF-8 encoded string to Kodi and to convert Kodi's UTF-8 output back to unicode.

File functions in Python

Windows

Windows' NTFS is unicode aware but Windows still uses codepages like cp-850 for Western Europe.

If you use Python file functions with string parameters then internally the strings will be converted to the Windows codepage which means that you cannot access a file with greek characters from an english Windows. But if you pass unicodes to the file functions then everything will work as expected!

Linux

When locale is set to C or POSIX Python will assume file system is ascii only and try to encode all unicode inputs to ascii. In reality file system does not have a specific encoding and utf-8 is a much better guess. Because of this you must not pass unicode to Python file functions!

Instead always use UTF-8 encoded strings.

Conclusion

Since your add-on should work with all supported OS, use the following approach:

if sys.platform.startswith('win'):
    file function with unicodes
else:
    file function with utf-8 encoded strings

Examples

Addon path

The first path an add-on has to deal with is it's own add-on path:

 path = addon.getAddonInfo('path').decode('utf-8')

Kodi's getAddonInfo returns an UTF-8 encoded string and we decode it an unicode.

Browse dialog

 dialog = xbmcgui.Dialog()
 directory = dialog.browse(0, 'Title' , 'pictures').decode('utf-8')

dialog.browse() returns an UTF-8 encoded string which perhaps contains some non latin characters. Therefore decode it to unicode!

Path joins

 os.path.join(path, filename)

If path and filename are unicodes then everthing will work as expected. But what will happen if filename is an UTF-8 encoded string which contains "öäü.jpg"?

Python always uses unicodes to join a string with an unicode. Therefore Python will decode the string with it's default encoding (ascii).

 os.path.join(path, filename.decode('ascii'))

Due to the missing öäü within the ASCII codepage you'll get an unicode exception! That's the reason why you must explicitly convert the string to unicode!

 os.path.join(path, filename.decode('utf-8'))

Logging

"print" and xbmc.log does not support unicode. Always encode unicode strings to utf-8.

    print message.encode('utf-8')

Alternatively, the following function can be used, where msg can be everything from string to unicode to class:

def log(msg, level=xbmc.LOGDEBUG):
    plugin = "My nice plugin"

    if isinstanceof(msg, unicode):
        msg = msg.encode('utf-8')

    xbmc.log("[%s] %s" % (plugin, msg.__str__()), level)

However, it's highly recommender to never mix byte strings and unicode strings in your program, in which case the 'if isinstanceof' is unnecessary.

See also

Development: